Skip to content

Modernize Dagster config and launch scripts#5071

Open
zaneselvans wants to merge 153 commits intomainfrom
dagster-housekeeping
Open

Modernize Dagster config and launch scripts#5071
zaneselvans wants to merge 153 commits intomainfrom
dagster-housekeeping

Conversation

@zaneselvans
Copy link
Copy Markdown
Member

@zaneselvans zaneselvans commented Mar 9, 2026

This PR is... :chonk: and larger than anyone should really have to review in one pass. But a lot of the changes were entangled with each other. Sorry! Hopefully this overview can get you oriented.

The PR moves us closer to current Dagster patterns by making dg launch the canonical way to run the ETL, consolidating ETL configuration into Dagster-native resources and config files, and removing our legacy custom launcher CLI. It is part of the broader Dagster housekeeping work tracked in #5066, and implements the scoped first round of changes summarized in #5120.

Zane Why Did You Do This?! 🤦🏼‍♂️

I... wanted to try out working with coding agents and the dagster skills while doing something meaningful and I drank way too much ☕ and got carried away. Over the course of like 2 weeks.

Our current Dagster setup works, but it's accumulated multiple ways to do the same thing:

  • local development, tests, and nightly builds launched the ETL through slightly different interfaces.
  • important runtime parameters were assembled in custom CLI wrappers rather than in Dagster-native config.
  • the FERC-to-SQLite conversion step lived in a separate code location and separate launch path, which made the overall execution story harder to follow.
  • tests had their own setup logic that only partially matched how production runs were executed.

This PR isn't trying to redesign the ETL semantics. It's trying to simplify how we invoke the pipeline, make the configuration model more legible and composable, and enable future expansion and use of Dagster's built in features so we don't have as much custom glue code to maintain.

What Changed In The Core Dagster Setup

1. dg launch is now the canonical execution path

After this PR, PUDL no longer relies on custom pudl_etl and ferc_to_sqlite console scripts (or the job construction machinery they used) to launch Dagster jobs. This builds on the earlier dg compatibility groundwork in #5061 and #5062.

  • removed the legacy pudl_etl and ferc_to_sqlite console entry points and their supporting modules.
  • switched pixi tasks and documentation to Dagster's dg launch CLI.
  • added a small set of canonical Dagster config profiles for common run modes:
    • dg_fast.yml
    • dg_full.yml
    • dg_pytest.yml
    • dg_nightly.yml
  • kept the execution surface area smaller and more explicit by routing local development, tests, and nightly builds through the same Dagster-native interface.

Why this matters:

  • there is now one primary way to launch work.
  • configuration is expressed where Dagster expects to find it.
  • we have less custom code to maintain and less project-specific knowledge required to run the ETL.

2. The FERC-to-SQLite conversion is now part of the main asset graph

This PR pulls the raw FERC DBF/XBRL extraction work into the main PUDL Dagster definitions rather than treating it as a separate code location. This makes it easy to compose FERC assets with PUDL assets in a single highly parallelizable graph for performance (CI & nightly builds) -- or keep them separate (for day-to-day ergonomics).

  • added granular raw_ferc_to_sqlite assets for DBF and XBRL extraction.
  • folded those assets into pudl.etl.defs, which is now the single code location loaded by dg.
  • removed the separate pudl.ferc_to_sqlite Dagster definitions package.
  • defined a clearer set of jobs for common execution patterns:
    • pudl: main PUDL ETL without rebuilding the raw FERC SQLite prerequisites.
    • ferc_to_sqlite: only rebuild the raw FERC prerequisite databases.
    • pudl_with_ferc_to_sqlite: run the full end-to-end build in one Dagster job.
    • ferceqr: keep the FERC EQR flow separate and clearly named.

Why this matters:

  • the dependency structure is easier to understand in one asset graph.
  • the main ETL and its prerequisites are modeled in the same Dagster vocabulary.
  • local and automated runs can choose the right job for the situation without maintaining separate implementations.

3. ETL configuration moved toward typed Dagster resources and config

Much of the refactor is about moving runtime configuration out of ad hoc script assembly and into typed Dagster objects.

  • unified ETL settings handling around a single etl_settings resource.
  • migrated runtime configuration to ConfigurableResource classes.
  • migrated key IO managers to ConfigurableIOManager classes.
  • introduced a small RuntimeSettings resource for execution knobs like XBRL worker counts and log levels.
  • exposed Zenodo DOI settings through Dagster resources so provenance and metadata can be attached during asset materialization.
  • updated job config assembly so ETL settings profiles can be loaded into Dagster run config directly.

Why this matters:

  • resource contracts are more explicit and type-checked.
  • tests and production runs can use the same config model.
  • adding new runtime settings should require less glue code in the future.

4. The definitions layer is easier to reuse in tests and follow-on organizational work

This PR also adds some structure that is mainly about maintainability rather than immediate user-visible behavior.

  • added build_defs() so tests and other callers can construct fresh Definitions objects with targeted overrides.
  • simplified pudl.definitions so dg loads a single merged source of truth.
  • removed older reconstructable-job and execute_job plumbing that was only needed by the retired CLIs.

Why this matters:

  • follow-on Dagster reorganization can happen against a cleaner baseline.
  • tests can override resources in a controlled way.
  • there is less hidden coupling between launch code and definitions assembly.

Testing And Developer Workflow Changes

One of my goals in this PR was to reduce the differences in how the ETL runs in tests, daily development, and our production builds. This resulted in a bunch of cleanup of our crufty test setup.

  • integration-test prebuilds now run through dg launch --job pudl_with_ferc_to_sqlite instead of constructing separate in-process job graphs.
  • pytest now takes --dg-config instead of --etl-settings for integration scenarios, meaning it can load the same config profiles as local and nightly runs.
  • live-output testing is now expressed as --live-pudl-output, which is more precise about what the flag actually does. Similarly the flag to use a temporary input directory is now --temp-pudl-input.
  • test fixtures now manage PUDL_INPUT and PUDL_OUTPUT more carefully to avoid mixed-suite path contamination.
  • integration tests now use their own real Dagster instance, which is totally separate from the user's Dagster instance, and has a standalone DAGSTER_HOME directory.
  • When using --live-pudl-output the integration tests instead use the user's Dagster instance and DAGSTER_HOME meaning all Dagster outputs are either isolated, or available (depending on the flag).
  • asset value loading in tests now uses build_defs() with resource overrides, which better matches the production definitions model.
  • the new Dagster and pytest setup means we also have direct access to Dagster assets and other definitions from within the tests.
  • added or expanded unit coverage around IO manager behavior and FERC provenance metadata.

Day to day, this should make development a more uniform:

  • use pixi run dg dev to inspect or launch runs from the Dagster UI.
  • use pixi run dg launch --job pudl --config ... for direct CLI execution.
  • use the Dagster config profiles instead of passing ETL settings through custom wrappers or needing to modify the single default config values embedded in the code (e.g. lower concurrency on local runs).
  • use pixi run pudl, pixi run ferc-to-sqlite, or pixi run pudl-with-ferc-to-sqlite depending on whether raw FERC prerequisites need to be rebuilt.
  • updated the nightly PUDL build to launch the ETL with dg launch and the nightly Dagster config profile.

The goal isn't to make everyone think more about Dagster. Really we want fewer custom choices, fewer custom flags, and less project-specific knowledge to keep track of and and transfer to new contributors. Hopefully this will let us focus more on the energy data and less on the plumbing.

Asset Provenance, Logging, And Other Supporting Improvements

Several changes in this PR were prompted by the Dagster refactor but weren't strictly part of the scope of #5120.

  • added FERC SQLite provenance metadata to the Dagster materializations for raw FERC prerequisite assets. This lets us keep them in a separate job for day to day work (since we hardly ever want to rebuild them) but that we still get an error if they need to be rebuilt (because the source Zenodo DOI or their ETL Settings have changed)
  • exposed Zenodo DOIs as a Dagster resource so provenance records can point back to the specific source releases being used.
  • made the FERC SQLite IO managers more robust for single-pass Dagster runs by reflecting metadata lazily when upstream databases do not exist yet at resource initialization time (which is what happens when the FERC assets are run as part of the main pudl_with_ferc_to_sqlite job).
  • cleaned up logging defaults and dependency log levels. though tbh this is still a mess.
  • filtered known Arelle console spam so XBRL runs are easier to read in CI and nightly logs.

Documentation changes worth noting

docs/dev/run_the_etl.rst — substantially rewritten

The "Running the ETL Pipeline" page is the primary reference for running PUDL locally. The rewrite replaces the old ops/graphs/jobs conceptual overview (which reflected the pre-asset Dagster programming model) with a description of the current asset-based model.

docs/dev/nightly_data_builds.rst — infrastructure update

The nightly build script steps are updated to reflect the new separate pytest-unit-nightly, pytest-integration-nightly, and pytest-data-validation-nightly pixi tasks. Reviewers should verify:

docs/dev/clone_ferc1.rst — new Mermaid diagram

A new flowchart showing the FERC extraction data flow (Zenodo → DBF/XBRL extraction → SQLite/DuckDB outputs → PUDL tables) was added.

Incidental Changes That Crept In

These changes are not the point of the PR, but they removed friction that came up during development that would have otherwise taken effort to work around, without fixing the underlying issues.

  • a pretty major cleanup of our pytest fixtures and configuration to avoid environment variable and path contamination between test suites, which became an issue with the new execution model.
  • split data validation tests (back!) out into a separate test module so it can easily be run separately from the main integration tests.
  • simplification of the integration tests fixture dependencies by using a single set of prebuilt outputs.
  • simplified some workflow conditionals and lint issues that surfaced while touching the build and test pipelines.
  • Addition of a dev feature and environment within pixi, to add some new requirements that are helpful for working with agents, and also just other stuff we want to have installed locally, but don't need in deployment.
  • Big update to the AGENTS.md file based on lessons learned in the process of this refactor.

Review Pointers

The most important changes are probably:

  • the changes to src/pudl/etl/__init__.py and the new src/pudl/etl/ferc_to_sqlite_assets.py, which define the new Dagster execution model.
  • the changes to src/pudl/resources.py, src/pudl/io_managers.py, and src/pudl/settings.py, which define the typed config/resource model.
  • the removal of the legacy CLI modules and console entry points.
  • the updates to test/conftest.py, which show how the new execution model is exercised in integration tests.
  • the new ferc_to_sqlite_provenance.py module safeguards against using stale FERC outputs but lets us keep them separate from the PUDL job day-to-day to avoid unnecessary materializations. It's an interesting example of the kinds of things you can do with asset-level metadata.

Remaining Tasks

  • Define an ergonomic workflow that avoids materializing FERC assets unnecessarily.
  • Factor peripheral changes out into Improve nightly build script readability and Slack reporting #5134 to ease review
  • Factor peripheral changes out into Fix mutable-default and validator bug in GridPathRAToolkitSettings #5136 to ease review
  • Change FERC SQLite provenance check to use year subset not settings hash
  • Do a thorough self-review of the PR, adding comments & explanations where needed.
  • Scan developer docs for readability, completeness, and correctness.
  • Do a final update to the release notes.
  • Update AGENTS.md to reflect lessons learned in this PR.
  • Do a final branch deployment to check everything still works.

Followup Issues

@zaneselvans zaneselvans added dagster Issues related to our use of the Dagster orchestrator developer experience Things that make the developers' lives easier, but don't necessarily directly improve the data. labels Mar 9, 2026
Base automatically changed from subset-multiassets to main March 11, 2026 00:15
… and dg-first config

This branch advances the Dagster housekeeping effort by simplifying run semantics,
reducing launch/config duplication, and making asset dependencies explicit enough to
support a clean cutover to canonical `dg launch` workflows.

Why this change:
- Align local, test, and automation execution around the same Dagster-native patterns.
- Replace job proliferation and script-side wiring with asset selection + resource config.
- Reduce hidden coupling between FERC prebuild steps and downstream extraction assets.
- Improve debuggability and contributor ergonomics ahead of deeper defs/module cleanup.

Core architecture changes:
- Consolidate job naming and intent in ETL defs:
  - Main ETL job renamed to `pudl` (from legacy `etl_*` naming).
  - FERC EQR job renamed to `ferceqr`.
  - `src/pudl/definitions.py` now points to `pudl.etl.defs` as the canonical
definitions source to avoid duplicate/merged registration drift.
- Integrate FERC-to-SQLite prerequisites as first-class Dagster assets:
  - Replace monolithic ferc_to_sqlite job wiring with granular DBF/XBRL prerequisite
assets (dataset + format scoped) in `src/pudl/ferc_to_sqlite/__init__.py`.
  - Keep backward-compatibility graph exports for existing tests/callers.
- Make raw extraction dependencies explicit:
  - FERC1/FERC714 raw asset specs and metadata assets now depend on the relevant
`raw_ferc*_...__sqlite` prerequisite keys.
  - This clarifies run ordering and improves selective materialization safety.

Config/resource and dg usability improvements:
- Add lightweight `dg` profile wrappers:
  - `src/pudl/package_data/settings/dg_fast.yml`
  - `src/pudl/package_data/settings/dg_full.yml`
- Extend resource config to load from an ETL settings path:
  - `dataset_settings` and `ferc_to_sqlite_settings` accept `etl_settings_path`.
- Centralize ETL settings loaders in `src/pudl/settings.py`:
  - `load_etl_settings(...)`
  - `load_packaged_etl_settings(...)`
  - Remove duplicate local loaders from resources/etl modules.
- Tighten typing while preserving practical flexibility:
  - `create_dagster_config(...)` now accepts pydantic model types used by ETL
resources (`FrozenBaseModel | BaseSettings`) rather than overly generic duck typing.

Runtime/logging behavior:
- Add runtime XBRL log-level control (`RuntimeSettings.xbrl_loglevel`).
- Thread loglevel through XBRL conversion entrypoints.
- Filter known Arelle message spam while preserving useful conversion output.

Testing/integration hardening:
- Rework integration DB prebuild in `test/conftest.py` to run `dg launch` under
coverage via subprocess helper.
- Preserve explicit `pudl.sqlite` schema initialization before ETL prebuild.
- Remove no-op legacy extract fixtures and wire engines through one prebuild path.
- Update unit expectations for new XBRL convert args (`loglevel`).

Type and safety cleanup:
- Resolve key `ferc1.py` annotation issues:
  - introduce typed table mapping alias/TypedDict,
  - guard optional settings retrieval,
  - fix respondent record typing,
  - annotate known `PudlPaths()` static-check false-positive sites.

Net effect:
This is a first round of Dagster housekeeping: it moves us toward a single mental model
(assets + config + `dg launch`), reduces bespoke launch glue, and sets up safer,
incremental follow-on work in `src/pudl/defs` and broader orchestration cleanup without
changing core data semantics.
…g defaults

This commit makes Dagster-housekeeping changes that move runtime behavior into
tracked config files, simplify CI prebuild orchestration, and reduce logging noise.

Why this change:
- Keep `dg launch` behavior transparent and reviewable via committed config files.
- Make pytest integration prebuilds deterministic and faster by honoring fast-profile
  settings that disable non-PUDL-integrated FERC forms.
- Unify logging policy in one place while keeping environment-specific toggles simple.

What changed:
- Add environment templates for local/CI ergonomics:
  - `.env.example` with PUDL paths, logging controls, optional docs toggles, and
    optional `DAGSTER_HOME`.
  - `.envrc` with `dotenv` for terminal env injection via direnv.

- Centralize logging defaults in `src/pudl/logging_helpers.py`:
  - Add `DEFAULT_DEPENDENCY_LOGLEVELS` with shared dependency logger levels.
  - Add env-driven runtime logging controls:
    - `PUDL_LOGLEVEL`
    - `PUDL_LOGFILE`
    - `PUDL_COLOR_LOGS`
  - Keep dependency loglevel policy code-defined (not env-defined).

- Expand Dagster run configs to include execution, logger, and resource settings:
  - `src/pudl/package_data/settings/dg_fast.yml`
  - `src/pudl/package_data/settings/dg_full.yml`
  - new `src/pudl/package_data/settings/dg_pytest_integration.yml`
  - Add:
    - `execution` executor settings (multiprocess for fast/full, in-process for pytest)
    - `loggers.console.config.log_level`
    - `datastore` config (`cloud_cache_path`, `use_local_cache`)
    - `runtime_settings` (`xbrl_num_workers`, `xbrl_batch_size`, `xbrl_loglevel`)

- Narrow `etl_fast` FERC-to-SQLite scope in settings:
  - In `src/pudl/package_data/settings/etl_fast.yml`, set `disabled: true` for
    FERC Forms 2, 6, and 60 (DBF + XBRL), keeping Forms 1 and 714 enabled.

- Simplify and improve CI prebuild fixture behavior in `test/conftest.py`:
  - Switch `dg launch` invocation from inline `--config-json` to `--config` file usage.
  - Point pytest prebuild to `dg_pytest_integration.yml`.
  - Stream child process output directly to stdout (with lightweight prefix) to avoid
    double-formatted log lines.
  - Remove duplicate dependency loglevel map from pytest fixture and rely on shared
    logging defaults.

Net effect:
- Changes consolidate run-time knobs into config files and shared logging helpers,
  reduce CI prebuild runtime by scoping `etl_fast` appropriately, and make logs easier
  to read across local, CI, and nightly execution contexts.
@zaneselvans zaneselvans force-pushed the dagster-housekeeping branch from 5987d8c to 1ac1395 Compare March 11, 2026 04:04
@aesharpe aesharpe moved this from New to In progress in Catalyst Megaproject Mar 11, 2026
@zaneselvans zaneselvans requested a review from jdangerx March 28, 2026 23:19
@zaneselvans zaneselvans marked this pull request as ready for review March 28, 2026 23:45
Copy link
Copy Markdown
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this clearly got out of hand. 🤖 🔥

Comment on lines -44 to -73
alembic upgrade head &&
ferc_to_sqlite \
--loglevel DEBUG \
--workers 8 \
"$PUDL_SETTINGS_YML"
}

function run_pudl_etl() {
echo "Running PUDL ETL"
pudl_etl \
--loglevel DEBUG \
"$PUDL_SETTINGS_YML"
}

function run_unit_tests() {
echo "Running unit tests"
pytest \
-n auto \
--etl-settings "$PUDL_SETTINGS_YML" \
--live-dbs test/unit \
--no-cov
}

function run_integration_tests() {
echo "Running integration tests"
pytest \
-n auto \
--etl-settings "$PUDL_SETTINGS_YML" \
--live-dbs test/integration \
--no-cov
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all goes away because the call details are now part of the pixi tasks, and we call them in one-liners below.


@asset(
required_resource_keys={"dataset_settings"},
required_resource_keys={"etl_settings"},
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change shows up all over the place and is responsible for most of the changed files. Because we're passing the ETL settings in to Dagster's config system now directly via the config file, it's nice to just list it once, and then let the config system pull out the individual settings it needs wherever it's doing work. It means one Config wrapper class instead of two, one line in the config file instead of two, less need to track which setting object is getting passed around exactly. Seemed cleaner overall. But it did mean a lot of lines of code touched.

@@ -1,4 +1,4 @@
"""Define tooling for monitoring the ferceqr_etl job during batch builds.
"""Define tooling for monitoring the ferceqr job during batch builds.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name because all of these jobs are ETL, so it doesn't really distinguish anything.

@resource(config_schema=create_dagster_config(DatasetsSettings()))
def dataset_settings(init_context) -> DatasetsSettings:
"""Dagster resource for parameterizing PUDL ETL assets.
class PudlEtlSettingsResource(ConfigurableResource):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main we had DatasetsSettings (for the PUDL ETL) and FercToSqliteSettings (FERC extraction config). They were assembled separately and passed around independently. The config schema was generated dynamically from create_dagster_config(), which mapped Pydantic model fields to Dagster's Field objects by hand. Merging them together and just having a single PUDL ETL settings object that can be accessed from context wherever any component is needed simplified things.

However, this change did lead to a lot of lines of code getting touched, because now everywhere we previously had the dataset_settings instance, now we have etl_settings -- when you see that elsewhere, this is why. It's not just a rename, the contents of that object are different now, and represent the contents of the entire PUDL ETL settings file.

in the Dagit UI.
"""
return FercToSqliteSettings(**init_context.resource_config)
class ZenodoDoiSettingsResource(ConfigurableResource):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New resource that gives Dagster direct access to the ZenodoDOIs that currently identify our raw inputs. I pulled this in to do the provenance tracking for FERC to SQLite DBs, but it'll be useful for another "what data version is this?" tracking / provenance / asset metadata we want to add to the system over time, and is less janky than just sneaking around Dagster's back and reading the zenodo_dois.yml file off of disk ourselves. Having it integrated into Dagster's abstractions means we can inject other values for testing if we want to as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/unit/settings_test.py

Deleted tests: _convert_settings_to_dagster_config and create_dagster_config

Two tests that validated the old Dagster config-dict generation machinery are removed — test_convert_settings_to_dagster_config and test_dagster_config_excludes_computed_parts. Those functions (_convert_settings_to_dagster_config, create_dagster_config) no longer exist in settings.py; the resource layer now uses Pydantic-native ConfigurableResource rather than hand-assembling Dagster Field dicts.

TestDatasetsSettingsResourceTestPudlEtlSettingsResource

The old class tested dataset_settings, a legacy @resource-decorated function that accepted raw config dicts and validated them via Dagster's config schema layer. The tests checked that unknown datasource keys and wrong field types raised DagsterInvalidConfigError, and that default values were wired through config_schema correctly.

All three of those tests are replaced by two tests against the new PudlEtlSettingsResource:

  • test_invalid_field_type — passes a non-string value for etl_settings_path and asserts that Pydantic raises ValidationError. The error type changes from DagsterInvalidConfigError to ValidationError because validation now happens in the Pydantic model, not in the Dagster config layer.
  • test_loads_from_file — the more meaningful test: constructs a real PudlEtlSettingsResource pointed at the packaged etl_fast.yml and asserts that it returns an EtlSettings with a populated ferc1 dataset settings object. This tests the full path from config → file load → Pydantic model.

New test_datastore_resource_loads

A new module-level test exercises the DatastoreResource migration. It constructs ZenodoDoiSettingsResource and DatastoreResource using the new context-manager resource API (from_resource_context_cm) and asserts the result is a real Datastore instance. This replaces implicit coverage that existed in the old pudl_datastore_fixture fixture wiring.

FercDbfToSqliteSettings added to abstract-base skip set

_all_settings_instances() collects every concrete settings class and tries to instantiate each with no arguments to check default-value sanity. FercDbfToSqliteSettings is added to the skip set alongside GenericDatasetSettings and FercGenericXbrlToSqliteSettings — it is an abstract intermediate base class, not a concrete settings type, so there is no meaningful default instance to construct.

Comment on lines +36 to +41
with pudl_engine.connect() as connection:
for table_name in required_tables:
first_row = connection.execute(
sa.select(sa.literal(1)).select_from(sa.table(table_name)).limit(1)
).scalar()
assert first_row is not None, f"Expected {table_name} to contain data."
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real change: check that the table not only exists, but that it has something in it to avoid the case where we initialize the DB but then don't write to it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/unit/io_managers_test.py

Apologies for the noise from the type annotation additions. Summary of the actual changes below.

Dead test removal

Three tests are removed that had already been abandoned on main:

  • test_error_when_handling_view_without_metadata — tested a view-writing path that no longer exists in the IO manager.
  • test_handling_view_with_metadata — was already @pytest.mark.skip with a comment saying "debug or remove"; it's now removed.
  • test_error_when_reading_view_without_metadata — companion to the above.

New tests for the migrated ConfigurableIOManager classes

Four new tests that cover the Pydantic-native IO managers:

test_mixed_format_io_manager_invalid_config — verifies that constructing PudlMixedFormatIOManager(write_to_parquet=False, read_from_parquet=True) raises RuntimeError. This guards the invariant that you can't read from parquet if you never wrote to it.

test_mixed_format_io_manager_initializes_backends — confirms that _sqlite_io_manager and _parquet_io_manager are lazily constructed and that the correct underlying classes are instantiated.

test_ferc_dbf_io_manager_uses_injected_dataset_settings and test_ferc_xbrl_io_manager_uses_injected_dataset_settings — the core regression tests for the new configurable managers. Each constructs a FercDbfSQLiteConfigurableIOManager / FercXbrlSQLiteConfigurableIOManager via model_copy with injected EtlSettings, mocks the underlying FercDbfSQLiteIOManager / FercXbrlSQLiteIOManager, and asserts that _query is called with the correct years extracted from the injected settings (ferc1.dbf_years and ferc1.xbrl_years respectively). A mock DagsterInstance returning valid provenance metadata is wired in so the provenance check passes.

test_ferc_dbf_io_manager_rejects_incompatible_provenance — verifies the provenance staleness check introduced in ferc_sqlite_provenance.py. It supplies a ZenodoDoiSettings with a deliberately wrong DOI for ferc1 and asserts that load_input raises RuntimeError with the message "Zenodo DOI mismatch" before ever calling pd.read_sql_query.

test_ferc_dbf_io_manager_requires_provenance_metadata — verifies the other provenance failure mode: when instance.get_latest_materialization_event returns None (no prior materialisation recorded), load_input raises RuntimeError with "No Dagster provenance metadata".

Class renames reflected

All references to FercXBRLSQLiteIOManager and FercDBFSQLiteIOManager are updated to FercXbrlSQLiteIOManager and FercDbfSQLiteIOManager respectively, and the new FercDbfSQLiteConfigurableIOManager and FercXbrlSQLiteConfigurableIOManager are imported and used in the new tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test/unit/ferc_sqlite_provenance_test.py

The tests divide cleanly into two groups:

get_ferc_sqlite_provenance and build_ferc_sqlite_provenance_metadata — these test the fingerprinting side: that a provenance object is constructed correctly from a db_name string, that the dataset/format fields are parsed correctly for all four FERC form/format combinations, that the years list is non-empty and sorted, that the metadata dict contains all the required keys, and that malformed db_name values (missing the _dbf/_xbrl suffix) are rejected with ValueError.

assert_ferc_sqlite_compatible — these test the compatibility-checking side: the function that a downstream IO manager calls before reading from a prebuilt SQLite file. The cases covered are:

  • No instance available → silent no-op (safe to call unconditionally)
  • Matching provenance → passes
  • Stored DB covers a superset of the required years → passes (important: requesting a fast subset of years should still work against a full DB)
  • DOI mismatch → RuntimeError
  • Stored DB missing required years → RuntimeError
  • No materialization event recorded → RuntimeError
  • DB was built with status="skipped" (i.e. years=[]) → RuntimeError

The mocker.MagicMock() pattern is used throughout to simulate a DagsterInstance that returns pre-constructed provenance metadata without requiring a real Dagster run. This keeps all these tests fast and dependency-free.

Comment on lines -195 to -196
co-located. Presuming you've run the ETL with the ``--ignore-foreign-key-constraints``
flag, you can also look at the PUDL ``plants_eia860`` and ``plants_all_ferc1`` tables
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore FK constraints no longer required b/c the FK constraint check is separated from building the DBs.

@zaneselvans zaneselvans requested a review from krivard March 29, 2026 04:49
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Mar 29, 2026
@zaneselvans zaneselvans added the agents Issues related to working with PUDL data or code using LLM/AI based coding agents label Mar 29, 2026
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal with the updates to AGENTS.md in this PR was

  • address the major issues that came up during the refactor using agents
  • create agent guidance that's "feature complete"
  • stay under the recommended 500 line limit.

Beyond ~500 lines the general consensus seems to be that you want to start using "progressive disclosure" to be more selective about what gets loaded into context. The initial pattern for that in Nov/Dec of last year was to have additional domain-specific docs-agent.md and test-agent.md etc. files at different levels in the repo hierarchy. That seems to have been largely supplanted now by Agent Skills.

The most helpful guide I came across for working on this file was Claude's Skill Authoring Best Practices. The AGENTS.md website was unreasonably vague.

@zaneselvans zaneselvans requested a review from zschira March 30, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Issues related to working with PUDL data or code using LLM/AI based coding agents cli Scripts and other command line interfaces to PUDL. dagster Issues related to our use of the Dagster orchestrator developer experience Things that make the developers' lives easier, but don't necessarily directly improve the data. nightly-builds Anything having to do with nightly builds or continuous deployment. testing Writing tests, creating test data, automating testing, etc.

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Modernize Dagster config and launch scripts

4 participants