Modernize Dagster config and launch scripts#5071
Conversation
e759334 to
fab9247
Compare
… and dg-first config This branch advances the Dagster housekeeping effort by simplifying run semantics, reducing launch/config duplication, and making asset dependencies explicit enough to support a clean cutover to canonical `dg launch` workflows. Why this change: - Align local, test, and automation execution around the same Dagster-native patterns. - Replace job proliferation and script-side wiring with asset selection + resource config. - Reduce hidden coupling between FERC prebuild steps and downstream extraction assets. - Improve debuggability and contributor ergonomics ahead of deeper defs/module cleanup. Core architecture changes: - Consolidate job naming and intent in ETL defs: - Main ETL job renamed to `pudl` (from legacy `etl_*` naming). - FERC EQR job renamed to `ferceqr`. - `src/pudl/definitions.py` now points to `pudl.etl.defs` as the canonical definitions source to avoid duplicate/merged registration drift. - Integrate FERC-to-SQLite prerequisites as first-class Dagster assets: - Replace monolithic ferc_to_sqlite job wiring with granular DBF/XBRL prerequisite assets (dataset + format scoped) in `src/pudl/ferc_to_sqlite/__init__.py`. - Keep backward-compatibility graph exports for existing tests/callers. - Make raw extraction dependencies explicit: - FERC1/FERC714 raw asset specs and metadata assets now depend on the relevant `raw_ferc*_...__sqlite` prerequisite keys. - This clarifies run ordering and improves selective materialization safety. Config/resource and dg usability improvements: - Add lightweight `dg` profile wrappers: - `src/pudl/package_data/settings/dg_fast.yml` - `src/pudl/package_data/settings/dg_full.yml` - Extend resource config to load from an ETL settings path: - `dataset_settings` and `ferc_to_sqlite_settings` accept `etl_settings_path`. - Centralize ETL settings loaders in `src/pudl/settings.py`: - `load_etl_settings(...)` - `load_packaged_etl_settings(...)` - Remove duplicate local loaders from resources/etl modules. - Tighten typing while preserving practical flexibility: - `create_dagster_config(...)` now accepts pydantic model types used by ETL resources (`FrozenBaseModel | BaseSettings`) rather than overly generic duck typing. Runtime/logging behavior: - Add runtime XBRL log-level control (`RuntimeSettings.xbrl_loglevel`). - Thread loglevel through XBRL conversion entrypoints. - Filter known Arelle message spam while preserving useful conversion output. Testing/integration hardening: - Rework integration DB prebuild in `test/conftest.py` to run `dg launch` under coverage via subprocess helper. - Preserve explicit `pudl.sqlite` schema initialization before ETL prebuild. - Remove no-op legacy extract fixtures and wire engines through one prebuild path. - Update unit expectations for new XBRL convert args (`loglevel`). Type and safety cleanup: - Resolve key `ferc1.py` annotation issues: - introduce typed table mapping alias/TypedDict, - guard optional settings retrieval, - fix respondent record typing, - annotate known `PudlPaths()` static-check false-positive sites. Net effect: This is a first round of Dagster housekeeping: it moves us toward a single mental model (assets + config + `dg launch`), reduces bespoke launch glue, and sets up safer, incremental follow-on work in `src/pudl/defs` and broader orchestration cleanup without changing core data semantics.
…g defaults
This commit makes Dagster-housekeeping changes that move runtime behavior into
tracked config files, simplify CI prebuild orchestration, and reduce logging noise.
Why this change:
- Keep `dg launch` behavior transparent and reviewable via committed config files.
- Make pytest integration prebuilds deterministic and faster by honoring fast-profile
settings that disable non-PUDL-integrated FERC forms.
- Unify logging policy in one place while keeping environment-specific toggles simple.
What changed:
- Add environment templates for local/CI ergonomics:
- `.env.example` with PUDL paths, logging controls, optional docs toggles, and
optional `DAGSTER_HOME`.
- `.envrc` with `dotenv` for terminal env injection via direnv.
- Centralize logging defaults in `src/pudl/logging_helpers.py`:
- Add `DEFAULT_DEPENDENCY_LOGLEVELS` with shared dependency logger levels.
- Add env-driven runtime logging controls:
- `PUDL_LOGLEVEL`
- `PUDL_LOGFILE`
- `PUDL_COLOR_LOGS`
- Keep dependency loglevel policy code-defined (not env-defined).
- Expand Dagster run configs to include execution, logger, and resource settings:
- `src/pudl/package_data/settings/dg_fast.yml`
- `src/pudl/package_data/settings/dg_full.yml`
- new `src/pudl/package_data/settings/dg_pytest_integration.yml`
- Add:
- `execution` executor settings (multiprocess for fast/full, in-process for pytest)
- `loggers.console.config.log_level`
- `datastore` config (`cloud_cache_path`, `use_local_cache`)
- `runtime_settings` (`xbrl_num_workers`, `xbrl_batch_size`, `xbrl_loglevel`)
- Narrow `etl_fast` FERC-to-SQLite scope in settings:
- In `src/pudl/package_data/settings/etl_fast.yml`, set `disabled: true` for
FERC Forms 2, 6, and 60 (DBF + XBRL), keeping Forms 1 and 714 enabled.
- Simplify and improve CI prebuild fixture behavior in `test/conftest.py`:
- Switch `dg launch` invocation from inline `--config-json` to `--config` file usage.
- Point pytest prebuild to `dg_pytest_integration.yml`.
- Stream child process output directly to stdout (with lightweight prefix) to avoid
double-formatted log lines.
- Remove duplicate dependency loglevel map from pytest fixture and rely on shared
logging defaults.
Net effect:
- Changes consolidate run-time knobs into config files and shared logging helpers,
reduce CI prebuild runtime by scoping `etl_fast` appropriately, and make logs easier
to read across local, CI, and nightly execution contexts.
5987d8c to
1ac1395
Compare
zaneselvans
left a comment
There was a problem hiding this comment.
Well this clearly got out of hand. 🤖 🔥
| alembic upgrade head && | ||
| ferc_to_sqlite \ | ||
| --loglevel DEBUG \ | ||
| --workers 8 \ | ||
| "$PUDL_SETTINGS_YML" | ||
| } | ||
|
|
||
| function run_pudl_etl() { | ||
| echo "Running PUDL ETL" | ||
| pudl_etl \ | ||
| --loglevel DEBUG \ | ||
| "$PUDL_SETTINGS_YML" | ||
| } | ||
|
|
||
| function run_unit_tests() { | ||
| echo "Running unit tests" | ||
| pytest \ | ||
| -n auto \ | ||
| --etl-settings "$PUDL_SETTINGS_YML" \ | ||
| --live-dbs test/unit \ | ||
| --no-cov | ||
| } | ||
|
|
||
| function run_integration_tests() { | ||
| echo "Running integration tests" | ||
| pytest \ | ||
| -n auto \ | ||
| --etl-settings "$PUDL_SETTINGS_YML" \ | ||
| --live-dbs test/integration \ | ||
| --no-cov |
There was a problem hiding this comment.
This all goes away because the call details are now part of the pixi tasks, and we call them in one-liners below.
|
|
||
| @asset( | ||
| required_resource_keys={"dataset_settings"}, | ||
| required_resource_keys={"etl_settings"}, |
There was a problem hiding this comment.
This change shows up all over the place and is responsible for most of the changed files. Because we're passing the ETL settings in to Dagster's config system now directly via the config file, it's nice to just list it once, and then let the config system pull out the individual settings it needs wherever it's doing work. It means one Config wrapper class instead of two, one line in the config file instead of two, less need to track which setting object is getting passed around exactly. Seemed cleaner overall. But it did mean a lot of lines of code touched.
| @@ -1,4 +1,4 @@ | |||
| """Define tooling for monitoring the ferceqr_etl job during batch builds. | |||
| """Define tooling for monitoring the ferceqr job during batch builds. | |||
There was a problem hiding this comment.
Changed the name because all of these jobs are ETL, so it doesn't really distinguish anything.
| @resource(config_schema=create_dagster_config(DatasetsSettings())) | ||
| def dataset_settings(init_context) -> DatasetsSettings: | ||
| """Dagster resource for parameterizing PUDL ETL assets. | ||
| class PudlEtlSettingsResource(ConfigurableResource): |
There was a problem hiding this comment.
On main we had DatasetsSettings (for the PUDL ETL) and FercToSqliteSettings (FERC extraction config). They were assembled separately and passed around independently. The config schema was generated dynamically from create_dagster_config(), which mapped Pydantic model fields to Dagster's Field objects by hand. Merging them together and just having a single PUDL ETL settings object that can be accessed from context wherever any component is needed simplified things.
However, this change did lead to a lot of lines of code getting touched, because now everywhere we previously had the dataset_settings instance, now we have etl_settings -- when you see that elsewhere, this is why. It's not just a rename, the contents of that object are different now, and represent the contents of the entire PUDL ETL settings file.
| in the Dagit UI. | ||
| """ | ||
| return FercToSqliteSettings(**init_context.resource_config) | ||
| class ZenodoDoiSettingsResource(ConfigurableResource): |
There was a problem hiding this comment.
New resource that gives Dagster direct access to the ZenodoDOIs that currently identify our raw inputs. I pulled this in to do the provenance tracking for FERC to SQLite DBs, but it'll be useful for another "what data version is this?" tracking / provenance / asset metadata we want to add to the system over time, and is less janky than just sneaking around Dagster's back and reading the zenodo_dois.yml file off of disk ourselves. Having it integrated into Dagster's abstractions means we can inject other values for testing if we want to as well.
There was a problem hiding this comment.
test/unit/settings_test.py
Deleted tests: _convert_settings_to_dagster_config and create_dagster_config
Two tests that validated the old Dagster config-dict generation machinery are removed — test_convert_settings_to_dagster_config and test_dagster_config_excludes_computed_parts. Those functions (_convert_settings_to_dagster_config, create_dagster_config) no longer exist in settings.py; the resource layer now uses Pydantic-native ConfigurableResource rather than hand-assembling Dagster Field dicts.
TestDatasetsSettingsResource → TestPudlEtlSettingsResource
The old class tested dataset_settings, a legacy @resource-decorated function that accepted raw config dicts and validated them via Dagster's config schema layer. The tests checked that unknown datasource keys and wrong field types raised DagsterInvalidConfigError, and that default values were wired through config_schema correctly.
All three of those tests are replaced by two tests against the new PudlEtlSettingsResource:
test_invalid_field_type— passes a non-string value foretl_settings_pathand asserts that Pydantic raisesValidationError. The error type changes fromDagsterInvalidConfigErrortoValidationErrorbecause validation now happens in the Pydantic model, not in the Dagster config layer.test_loads_from_file— the more meaningful test: constructs a realPudlEtlSettingsResourcepointed at the packagedetl_fast.ymland asserts that it returns anEtlSettingswith a populatedferc1dataset settings object. This tests the full path from config → file load → Pydantic model.
New test_datastore_resource_loads
A new module-level test exercises the DatastoreResource migration. It constructs ZenodoDoiSettingsResource and DatastoreResource using the new context-manager resource API (from_resource_context_cm) and asserts the result is a real Datastore instance. This replaces implicit coverage that existed in the old pudl_datastore_fixture fixture wiring.
FercDbfToSqliteSettings added to abstract-base skip set
_all_settings_instances() collects every concrete settings class and tries to instantiate each with no arguments to check default-value sanity. FercDbfToSqliteSettings is added to the skip set alongside GenericDatasetSettings and FercGenericXbrlToSqliteSettings — it is an abstract intermediate base class, not a concrete settings type, so there is no meaningful default instance to construct.
| with pudl_engine.connect() as connection: | ||
| for table_name in required_tables: | ||
| first_row = connection.execute( | ||
| sa.select(sa.literal(1)).select_from(sa.table(table_name)).limit(1) | ||
| ).scalar() | ||
| assert first_row is not None, f"Expected {table_name} to contain data." |
There was a problem hiding this comment.
Real change: check that the table not only exists, but that it has something in it to avoid the case where we initialize the DB but then don't write to it.
There was a problem hiding this comment.
test/unit/io_managers_test.py
Apologies for the noise from the type annotation additions. Summary of the actual changes below.
Dead test removal
Three tests are removed that had already been abandoned on main:
test_error_when_handling_view_without_metadata— tested a view-writing path that no longer exists in the IO manager.test_handling_view_with_metadata— was already@pytest.mark.skipwith a comment saying "debug or remove"; it's now removed.test_error_when_reading_view_without_metadata— companion to the above.
New tests for the migrated ConfigurableIOManager classes
Four new tests that cover the Pydantic-native IO managers:
test_mixed_format_io_manager_invalid_config — verifies that constructing PudlMixedFormatIOManager(write_to_parquet=False, read_from_parquet=True) raises RuntimeError. This guards the invariant that you can't read from parquet if you never wrote to it.
test_mixed_format_io_manager_initializes_backends — confirms that _sqlite_io_manager and _parquet_io_manager are lazily constructed and that the correct underlying classes are instantiated.
test_ferc_dbf_io_manager_uses_injected_dataset_settings and test_ferc_xbrl_io_manager_uses_injected_dataset_settings — the core regression tests for the new configurable managers. Each constructs a FercDbfSQLiteConfigurableIOManager / FercXbrlSQLiteConfigurableIOManager via model_copy with injected EtlSettings, mocks the underlying FercDbfSQLiteIOManager / FercXbrlSQLiteIOManager, and asserts that _query is called with the correct years extracted from the injected settings (ferc1.dbf_years and ferc1.xbrl_years respectively). A mock DagsterInstance returning valid provenance metadata is wired in so the provenance check passes.
test_ferc_dbf_io_manager_rejects_incompatible_provenance — verifies the provenance staleness check introduced in ferc_sqlite_provenance.py. It supplies a ZenodoDoiSettings with a deliberately wrong DOI for ferc1 and asserts that load_input raises RuntimeError with the message "Zenodo DOI mismatch" before ever calling pd.read_sql_query.
test_ferc_dbf_io_manager_requires_provenance_metadata — verifies the other provenance failure mode: when instance.get_latest_materialization_event returns None (no prior materialisation recorded), load_input raises RuntimeError with "No Dagster provenance metadata".
Class renames reflected
All references to FercXBRLSQLiteIOManager and FercDBFSQLiteIOManager are updated to FercXbrlSQLiteIOManager and FercDbfSQLiteIOManager respectively, and the new FercDbfSQLiteConfigurableIOManager and FercXbrlSQLiteConfigurableIOManager are imported and used in the new tests.
There was a problem hiding this comment.
test/unit/ferc_sqlite_provenance_test.py
The tests divide cleanly into two groups:
get_ferc_sqlite_provenance and build_ferc_sqlite_provenance_metadata — these test the fingerprinting side: that a provenance object is constructed correctly from a db_name string, that the dataset/format fields are parsed correctly for all four FERC form/format combinations, that the years list is non-empty and sorted, that the metadata dict contains all the required keys, and that malformed db_name values (missing the _dbf/_xbrl suffix) are rejected with ValueError.
assert_ferc_sqlite_compatible — these test the compatibility-checking side: the function that a downstream IO manager calls before reading from a prebuilt SQLite file. The cases covered are:
- No instance available → silent no-op (safe to call unconditionally)
- Matching provenance → passes
- Stored DB covers a superset of the required years → passes (important: requesting a fast subset of years should still work against a full DB)
- DOI mismatch →
RuntimeError - Stored DB missing required years →
RuntimeError - No materialization event recorded →
RuntimeError - DB was built with
status="skipped"(i.e.years=[]) →RuntimeError
The mocker.MagicMock() pattern is used throughout to simulate a DagsterInstance that returns pre-constructed provenance metadata without requiring a real Dagster run. This keeps all these tests fast and dependency-free.
| co-located. Presuming you've run the ETL with the ``--ignore-foreign-key-constraints`` | ||
| flag, you can also look at the PUDL ``plants_eia860`` and ``plants_all_ferc1`` tables |
There was a problem hiding this comment.
Ignore FK constraints no longer required b/c the FK constraint check is separated from building the DBs.
There was a problem hiding this comment.
My goal with the updates to AGENTS.md in this PR was
- address the major issues that came up during the refactor using agents
- create agent guidance that's "feature complete"
- stay under the recommended 500 line limit.
Beyond ~500 lines the general consensus seems to be that you want to start using "progressive disclosure" to be more selective about what gets loaded into context. The initial pattern for that in Nov/Dec of last year was to have additional domain-specific docs-agent.md and test-agent.md etc. files at different levels in the repo hierarchy. That seems to have been largely supplanted now by Agent Skills.
The most helpful guide I came across for working on this file was Claude's Skill Authoring Best Practices. The AGENTS.md website was unreasonably vague.
This PR is... :chonk: and larger than anyone should really have to review in one pass. But a lot of the changes were entangled with each other. Sorry! Hopefully this overview can get you oriented.
The PR moves us closer to current Dagster patterns by making
dg launchthe canonical way to run the ETL, consolidating ETL configuration into Dagster-native resources and config files, and removing our legacy custom launcher CLI. It is part of the broader Dagster housekeeping work tracked in #5066, and implements the scoped first round of changes summarized in #5120.Zane Why Did You Do This?! 🤦🏼♂️
I... wanted to try out working with coding agents and the dagster skills while doing something meaningful and I drank way too much ☕ and got carried away. Over the course of like 2 weeks.
Our current Dagster setup works, but it's accumulated multiple ways to do the same thing:
This PR isn't trying to redesign the ETL semantics. It's trying to simplify how we invoke the pipeline, make the configuration model more legible and composable, and enable future expansion and use of Dagster's built in features so we don't have as much custom glue code to maintain.
What Changed In The Core Dagster Setup
1.
dg launchis now the canonical execution pathAfter this PR, PUDL no longer relies on custom
pudl_etlandferc_to_sqliteconsole scripts (or the job construction machinery they used) to launch Dagster jobs. This builds on the earlierdgcompatibility groundwork in #5061 and #5062.pudl_etlandferc_to_sqliteconsole entry points and their supporting modules.dg launchCLI.dg_fast.ymldg_full.ymldg_pytest.ymldg_nightly.ymlWhy this matters:
2. The FERC-to-SQLite conversion is now part of the main asset graph
This PR pulls the raw FERC DBF/XBRL extraction work into the main PUDL Dagster definitions rather than treating it as a separate code location. This makes it easy to compose FERC assets with PUDL assets in a single highly parallelizable graph for performance (CI & nightly builds) -- or keep them separate (for day-to-day ergonomics).
raw_ferc_to_sqliteassets for DBF and XBRL extraction.pudl.etl.defs, which is now the single code location loaded bydg.pudl.ferc_to_sqliteDagster definitions package.pudl: main PUDL ETL without rebuilding the raw FERC SQLite prerequisites.ferc_to_sqlite: only rebuild the raw FERC prerequisite databases.pudl_with_ferc_to_sqlite: run the full end-to-end build in one Dagster job.ferceqr: keep the FERC EQR flow separate and clearly named.Why this matters:
3. ETL configuration moved toward typed Dagster resources and config
Much of the refactor is about moving runtime configuration out of ad hoc script assembly and into typed Dagster objects.
etl_settingsresource.ConfigurableResourceclasses.ConfigurableIOManagerclasses.RuntimeSettingsresource for execution knobs like XBRL worker counts and log levels.Why this matters:
4. The definitions layer is easier to reuse in tests and follow-on organizational work
This PR also adds some structure that is mainly about maintainability rather than immediate user-visible behavior.
build_defs()so tests and other callers can construct freshDefinitionsobjects with targeted overrides.pudl.definitionssodgloads a single merged source of truth.execute_jobplumbing that was only needed by the retired CLIs.Why this matters:
Testing And Developer Workflow Changes
One of my goals in this PR was to reduce the differences in how the ETL runs in tests, daily development, and our production builds. This resulted in a bunch of cleanup of our crufty test setup.
dg launch --job pudl_with_ferc_to_sqliteinstead of constructing separate in-process job graphs.--dg-configinstead of--etl-settingsfor integration scenarios, meaning it can load the same config profiles as local and nightly runs.--live-pudl-output, which is more precise about what the flag actually does. Similarly the flag to use a temporary input directory is now--temp-pudl-input.PUDL_INPUTandPUDL_OUTPUTmore carefully to avoid mixed-suite path contamination.DAGSTER_HOMEdirectory.--live-pudl-outputthe integration tests instead use the user's Dagster instance andDAGSTER_HOMEmeaning all Dagster outputs are either isolated, or available (depending on the flag).build_defs()with resource overrides, which better matches the production definitions model.Day to day, this should make development a more uniform:
pixi run dg devto inspect or launch runs from the Dagster UI.pixi run dg launch --job pudl --config ...for direct CLI execution.pixi run pudl,pixi run ferc-to-sqlite, orpixi run pudl-with-ferc-to-sqlitedepending on whether raw FERC prerequisites need to be rebuilt.dg launchand the nightly Dagster config profile.The goal isn't to make everyone think more about Dagster. Really we want fewer custom choices, fewer custom flags, and less project-specific knowledge to keep track of and and transfer to new contributors. Hopefully this will let us focus more on the energy data and less on the plumbing.
Asset Provenance, Logging, And Other Supporting Improvements
Several changes in this PR were prompted by the Dagster refactor but weren't strictly part of the scope of #5120.
pudl_with_ferc_to_sqlitejob).Documentation changes worth noting
docs/dev/run_the_etl.rst— substantially rewrittenThe "Running the ETL Pipeline" page is the primary reference for running PUDL locally. The rewrite replaces the old ops/graphs/jobs conceptual overview (which reflected the pre-asset Dagster programming model) with a description of the current asset-based model.
docs/dev/nightly_data_builds.rst— infrastructure updateThe nightly build script steps are updated to reflect the new separate
pytest-unit-nightly,pytest-integration-nightly, andpytest-data-validation-nightlypixi tasks. Reviewers should verify:docs/dev/clone_ferc1.rst— new Mermaid diagramA new flowchart showing the FERC extraction data flow (Zenodo → DBF/XBRL extraction → SQLite/DuckDB outputs → PUDL tables) was added.
Incidental Changes That Crept In
These changes are not the point of the PR, but they removed friction that came up during development that would have otherwise taken effort to work around, without fixing the underlying issues.
devfeature and environment within pixi, to add some new requirements that are helpful for working with agents, and also just other stuff we want to have installed locally, but don't need in deployment.AGENTS.mdfile based on lessons learned in the process of this refactor.Review Pointers
The most important changes are probably:
src/pudl/etl/__init__.pyand the newsrc/pudl/etl/ferc_to_sqlite_assets.py, which define the new Dagster execution model.src/pudl/resources.py,src/pudl/io_managers.py, andsrc/pudl/settings.py, which define the typed config/resource model.test/conftest.py, which show how the new execution model is exercised in integration tests.ferc_to_sqlite_provenance.pymodule safeguards against using stale FERC outputs but lets us keep them separate from the PUDL job day-to-day to avoid unnecessary materializations. It's an interesting example of the kinds of things you can do with asset-level metadata.Remaining Tasks
AGENTS.mdto reflect lessons learned in this PR.Followup Issues
pudl_parquet_datapackage.json#5122pudl.dagsterregistry #5123