Organize canonical pudl.dagster registry... and other things#5124
Organize canonical pudl.dagster registry... and other things#5124zaneselvans wants to merge 46 commits intodagster-housekeepingfrom
pudl.dagster registry... and other things#5124Conversation
pudl.dagster the canonical Dagster registry
Move the canonical Dagster assembly from pudl.defs into a new pudl.dagster package, split the old registry module by Dagster abstraction, and update the stable code location plus internal imports to point at the new package layout.
- convert pudl.dagster.assets from a single module into a package - move the orphan asset-definition modules out of pudl.etl and into: - pudl.dagster.assets.core.glue - pudl.dagster.assets.core.static - pudl.dagster.assets.core.eia_bulk_elec - pudl.dagster.assets.raw.ferc_to_sqlite - preserve the existing asset loader surface from pudl.dagster.assets while updating it to import from the new core/ and raw/ subpackages - update Dagster configuration so dg resolves the canonical defs module from pudl.dagster and uses pudl.definitions as the code location entrypoint - update metadata doc references to the new glue module paths - update focused unit tests to import and patch the moved asset modules - keep pixi.lock in sync with the local editable package metadata after validation
- Move reusable foreign key validation helpers into pudl.validate and move the pudl_check_fks CLI wrapper into pudl.scripts. - Update the pudl_check_fks console entrypoint and dependent imports so foreign key checks keep working after removing pudl.etl. - Move group_mean_continuity_check into pudl.dagster.asset_checks, where the EIA 860 and EIA 923 Dagster asset checks actually use it. - Remove the now-empty pudl.etl package so the module layout matches current orchestration and validation responsibilities. - Keep Dagster's runtime input annotation for schema asset checks and add a narrow type ignore because IO manager loading depends on it. - Refresh pixi.lock to capture the package metadata update from these staged Python changes.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dagster-housekeeping #5124 +/- ##
=======================================================
Coverage ? 93.22%
=======================================================
Files ? 228
Lines ? 19357
Branches ? 0
=======================================================
Hits ? 18045
Misses ? 1312
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Make pudl.dagster the single, explicit home for orchestration code so Dagster behavior is no longer scattered across the top-level package. - Use lazy imports for the dagster subpackage to avoid circular imports that have caused problems in the past in other contexts. - Keep the FERC SQLite provenance checks next to the assets and IO managers that consume them, since they are part of the runtime contract that tells us whether a stored SQLite artifact can be reused safely in the current Dagster run. - Treat the FERC and PUDL IO managers as Dagster orchestration infrastructure rather than generic helpers, because they control how assets read prerequisite SQLite outputs and write managed Parquet and SQLite artifacts. - Leave only genuinely shared helpers at the top level so package boundaries better match architectural responsibility. - Update tests and developer docs to reinforce pudl.dagster as the stable orchestration API.
- Move Dagster resources into pudl.dagster.resources and remove pudl.resources. - Add dedicated pudl.dagster.partitions and pudl.dagster.sensors modules for FERC EQR orchestration state. - Update IO managers, extractors, glue code, and tests to use the new Dagster imports and resource names. - Expand module docstrings and update developer docs and release notes to point at the canonical Dagster package. - Clarify the pytest-integration Pixi task description to mention data validation coverage.
pudl.dagster the canonical Dagster registrypudl.dagster
The pudl.convert subpackage was a loose collection of two unrelated utilities that didn't belong together. This commit moves each module to a more semantically correct home: - censusdp1tract_to_sqlite.py → pudl.extract.censusdp1tract: The Census DP1 conversion is logically an extraction step (raw geodata to SQLite), so it belongs in pudl.extract alongside the other extractors that feed the pipeline. - metadata_to_rst.py → pudl.scripts.metadata_to_rst: This is a documentation-generation script, not a data conversion utility. Moving it to pudl.scripts puts it with other standalone scripts. Related doc and reference updates follow from these renames. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Convert pudl.validate from a flat module into a subpackage with three submodules organized by validation framework and concern: - pudl.validate.dbt: dbt invocation wrappers, relocated from the top-level pudl.dbt_wrapper module - pudl.validate.integrity: database integrity checks (ForeignKeyError, ForeignKeyErrors, check_foreign_keys), split from the old validate.py - pudl.validate.quality: bespoke data quality utilities (no_null_rows, weighted_quantile, ExcessiveNullRowsError), split from the old validate.py The package __init__.py contains only a docstring; callers import from the specific submodule. All affected scripts, tests, and docs have been updated to use the new import paths. Also adds two commented-out pytestmark stubs in ferc_xbrl_extract_test.py, glue_test.py, and ferc_sqlite_provenance_test.py as placeholders for a planned ferc1_sqlite_provenance test label. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pudl.dagsterpudl.dagster registry... and other things
zaneselvans
left a comment
There was a problem hiding this comment.
Finished the self-review. Small cleanup tasks that fell out of it:
- Rationalize ETL settings class names and instance names and standardize across the whole repo.
- Put
get_ferc_form_nameinpudl.dagster.io_managerswith leading underscore, since that's the only place it's used. - Document the reason we need
dagster_asset_checks_test.pyregression test. - Review the remaining non-asset dagster imports happening outside of
pudl.dagsterand see if they ought to be pulled underpudl.dagsteror if there's a good reason for them to be where they are (ferceqr,sec10k,record_linkage,ml_tools,validate/dbt,transform/epacems
| .. code-block:: console | ||
|
|
||
| $ pudl_datastore --dataset eia860 | ||
| $ pudl_datastore eia860 |
There was a problem hiding this comment.
Mandatory args with no flag are more the standard CLI idiom, and we've already changed to this on the zenodo DOI udpater script.
| - name: Update AWS S3 cache with any new Zenodo archives | ||
| run: | | ||
| pixi run pudl_datastore --cloud-cache-path ${{ env.CLOUD_CACHE_PATH }} --bypass-local-cache --loglevel DEBUG | ||
| pixi run pudl_datastore --all --cloud-cache-path ${{ env.CLOUD_CACHE_PATH }} --bypass-local-cache --loglevel DEBUG |
There was a problem hiding this comment.
Switch to explicit --all to run all available datasets.
Mandatory args with no flag are more the standard CLI idiom, and we've already changed to this on the zenodo DOI udpater script.
| ################################################################################ | ||
| # Provide a CLI for generating service territories | ||
| ################################################################################ |
There was a problem hiding this comment.
CLI wrapper relocated to src/pudl/scripts/
| _write_status_file("FAILURE") | ||
|
|
||
|
|
||
| @dg.sensor( |
There was a problem hiding this comment.
Sensor relocated to src/pudl/datster/sensors.py
| logger = get_logger(__name__) | ||
|
|
||
|
|
||
| ferceqr_sensor_status = ( |
There was a problem hiding this comment.
Sensor relocated to src/pudl/datster/sensors.py
| @@ -0,0 +1,34 @@ | |||
| """Regression tests for Dagster asset-check input typing.""" | |||
There was a problem hiding this comment.
Our dataframe handling in IO Managers depends on getting the correct types. But the type annotation we have in the function that deals with this looks fishy, and either humans or agents could be prone to removing it as spurious. This test ensures that the tests break if they do. Need to give a better real explanation in the docstring of the what/why.
| @@ -14,6 +14,8 @@ | |||
| from pudl.settings import EtlSettings, FercToSqliteSettings | |||
| from pudl.workspace.datastore import ZenodoDoiSettings | |||
|
|
|||
| # pytestmark: MarkDecorator = pytest.mark.ferc1_sqlite_provenance | |||
| - `etl/` — Dagster asset and job definitions, organized by data source; top-level | ||
| `defs` object and all jobs live in `pudl.etl` | ||
| - `dagster/` — all Dagster orchestration code; the canonical home for everything | ||
| Dagster-specific. Sub-structure: | ||
| - `dagster/asset_checks.py` — asset-check definitions, factories, and helpers | ||
| - `dagster/assets/` — oddball asset definitions that don't fit the per-source layout; | ||
| `assets/core/` for processed assets, `assets/raw/` for raw-extraction assets | ||
| - `dagster/build.py` — assembles the `dagster.Definitions` object | ||
| - `dagster/config.py` — reusable Dagster run-config fragments and helpers | ||
| - `dagster/io_managers.py` — IO managers for SQLite, Parquet, and FERC SQLite reads | ||
| - `dagster/jobs.py` — named Dagster jobs (`pudl`, `ferc_to_sqlite`, `ferceqr`) | ||
| - `dagster/partitions.py` — shared partition definitions | ||
| - `dagster/provenance.py` — FERC SQLite fingerprinting and compatibility checks | ||
| - `dagster/resources.py` — `ConfigurableResource` definitions and default resource map | ||
| - `dagster/sensors.py` — sensor-based automation (e.g. the FERC EQR sensor) | ||
| - `deploy/` — post-ETL deployment logic: publishing outputs to GCS/S3, updating | ||
| `nightly`/`stable` git branches, triggering Zenodo releases, applying GCS holds, | ||
| and redeploying the PUDL Viewer Cloud Run service. `ferceqr.py` contains Dagster | ||
| assets specific to the FERC EQR batch pipeline; `pudl.py` covers full PUDL builds. | ||
| - `scripts/` — all CLI entry points as thin wrappers; one module per script, each | ||
| exposing a `main` Click command. Registered in `[project.scripts]` in `pyproject.toml` |
There was a problem hiding this comment.
Update repo map to reflect new repo layout.
src/pudl/helpers.py
Outdated
| return all_strings.difference(old_strings) | ||
|
|
||
|
|
||
| def get_ferc_form_name(db_name: str) -> str: |
There was a problem hiding this comment.
It looks like this function is used only in pudl.dagster.io_managers.py in which case I should put it (back) there and treat it as an internal-only helper with a leading underscore _get_ferc_form
| "etl_settings": etl_settings, | ||
| "zenodo_dois": zenodo_dois, | ||
| "etl_settings": pudl_etl_settings_resource, | ||
| "zenodo_dois": zenodo_doi_settings_resource, |
There was a problem hiding this comment.
With the unification of the PUDL ETL Settings object, I want to rationalize the names for readability and make sure they're the same everywhere. EtlSettings and DatasetsSettings aren't great.
Overview
This PR is about rationalizing the organization of the PUDL repo's Dagster-specific code and CLI entry points. It does not (should not!) change any of the actual Dagster code or behavior, just where it lives in the repo and how it's organized.
In the process, some non-Dagster code that had been living in sin with Dagster was evicted and found a new home in packages further from the city center, leading to a much longer commute. In some cases whole modules or packages were demolished, and replaced with Dagster luxury apartments or clean script wrappers. Some names have been changed to protect the innocent and ensure that object references still make sense in their new contexts. In a subsequent PR we will definitely start charging for on-street parking in the new Dagster neighborhood, so this is the last chance to get free parking for your Dagster code!
The new
pudl.dagsterpackageWhat's in there?
The new dedicated Dagster orchestration package at
pudl.dagsteris structured like this:pudl.dagster.asset_checksfor asset-check definitions, factories and helperspudl.dagster.assetsfor asset definitions and registration (only the oddball assets previously defined inpudl.etl)pudl.dagster.buildtools for dynamically assemblingdagster.Definitionspudl.dagster.configfor reusable Dagster run-config fragments and helperspudl.dagster.io_managersfor the Dagster IO managers that read and write PUDL assets and prerequisite SQLite databasespudl.dagster.jobsfor the standard named Dagster jobs likepudl,ferc_to_sqlite, andferceqrpudl.dagster.partitionsfor shared partition definitions used across assets, checks, and automationpudl.dagster.provenancefor FERC SQLite provenance metadata and compatibility checking helperspudl.dagster.resourcesfor DagsterConfigurableResourcedefinitions and the default resource mappingpudl.dagster.sensorsfor sensor-based automation like the FERC EQR deployment sensorThe package exposes its public surface (e.g.
build_defs,defs,default_assets, …) via lazy imports inpudl.dagster.__init__, so importing the package name does not trigger the full Dagster import chain.Where'd it come from?
Dagster-related code that was previously scattered throughout the repo was collected under
pudl.dagster, in some cases resulting in elimination of existing modules or packages. Relocated code included:pudl.io_managers→pudl.dagster.io_managerspudl.ferc_sqlite_provenance→pudl.dagster.provenancepudl.resources→pudl.dagster.resourcespudl.settings→pudl.dagster.partitions(partitions only!)pudl.etl.static_assets→pudl.dagster.assets.core.staticpudl.etl.glue_assets→pudl.dagster.assets.core.gluepudl.etl.eia_bulk_elec_assets→pudl.dagster.assets.core.eiaapi_electricity(also renamed to drop the staleeia_bulk_elecprefix)pudl.etl.ferc_to_sqlite_assets→pudl.dagster.assets.raw.ferc_to_sqlitepudl.deploy.ferceqr→pudl.dagster.sensors(sensor only!)pudl.etl→pudl.dagster.configpudl.etl.asset_checks→pudl.dagster.asset_checkspudl.etl→pudl.dagster.jobspudl.etl→pudl.dagster.buildNo longer living in sin with Dagster 😈
pudl.etl.check_foreign_keys→pudl.validate.integrity(foreign key checking function, error reporting, exception classes)pudl.etl.check_foreign_keys→pudl.scripts.pudl_check_fks(foreign key check script entrypoint / CLI logic)pudl.etl.asset_checks→pudl.validate.quality(no_null_rowsandweighted_quantiledata quality helpers)pudl.dbt_wrapper→pudl.validate.dbt(dbt invocation wrapper with failure context reporting)Re-organization of the
pudl.validatepackagepudl.validateis now a subpackage rather than a flat module, providing a structured home for all PUDL data validation logic organized by framework and approach:pudl.validate.dbt— dbt invocation wrappers (build_with_context,dagster_to_dbt_selection, etc.), relocated from the top-levelpudl.dbt_wrappermodulepudl.validate.integrity— database integrity checks (foreign key validation classes andcheck_foreign_keys), split out from the old flatpudl.validatemodulepudl.validate.quality— bespoke data quality utilities (no_null_rows,weighted_quantile), split out from the old flatpudl.validatemoduleThe package
__init__.pycontains only a docstring — all callers have been updated to import from the specific submodule.Consolidated CLI scripts under
pudl.scriptsWhat changed?
All PUDL console script entry points are now defined as thin wrappers in
pudl.scripts, with substantive logic remaining in the modules that own it. Previously several scripts had their CLI code mixed in with non-CLI module logic scattered across the package. Two wayward CLIs were relocated topudl.scripts:pudl.workspace.datastore(thepudl_datastoreCLI) →pudl.scripts.pudl_datastorepudl.analysis.service_territory(thepudl_service_territoriesCLI) →pudl.scripts.pudl_service_territoriesAll entry point functions are standardized to be named
main, and all scripts now useif __name__ == "__main__": sys.exit(main())to propagate exit status to the shell.The
pudl_datastoreCLI also gained a--allflag that operates on every known dataset without requiring a hardcoded list, replacing the old--dataset/-dflag with a positionalnargs=-1argument.Structural conventions enforced by tests
A new test module
test/unit/scripts/test_scripts_conventions.pyenforces two structural rules:@click.commandor@click.groupdecorators outsidesrc/pudl/scripts/if __name__ == "__main__": sys.exit(main())Dissolved
pudl.convertsubpackageThe
pudl.convertsubpackage was a loose collection of two unrelated utilities. Each module was moved to a more semantically appropriate home:pudl.convert.censusdp1tract_to_sqlite→pudl.extract.censusdp1tract(Census DP1 conversion is logically an extraction step)pudl.convert.metadata_to_rst→pudl.scripts.metadata_to_rst(documentation-generation script, not a data conversion utility)Dissolved
pudl.defsplaceholder packageThe
pudl.defspackage previously existed solely as an empty placeholder to makedg check defswork. With the lazy-loading wiring added topudl.dagster.__init__, thedefsobject is assembled there and re-exported from the existingpudl.definitionsmodule, which remains the stabledg-compatible entry point. Thepudl.defspackage is no longer needed.Other Changes
test/unit/dagster_asset_checks_test.pyto verify behavior of the runtime types expected by generated asset checks. This was causing a type warning which makes it tempting to re-write naively, but actually the runtime typing is necessary to get the correct behavior from the generated checks, so this test is intended to prevent future accidental breakage of that behavior.dg.*import prefix is now used canonically for all dagster imports within the registrypudl_datastore --allinstead of relying on a default of downloading all datasets when none are explicitly providedMigration map
flowchart LR subgraph Old_Locations[Previous locations] io[pudl.io_managers] prov[pudl.ferc_sqlite_provenance] res[pudl.resources] settings[pudl.settings] etlinit[pudl.etl] etlchecks[pudl.etl.asset_checks] static[pudl.etl.static_assets] glue[pudl.etl.glue_assets] bulk[pudl.etl.eia_bulk_elec_assets] sqlite[pudl.etl.ferc_to_sqlite_assets] ferceqr[pudl.deploy.ferceqr] fk[pudl.etl.check_foreign_keys] cvt_census[pudl.convert.censusdp1tract_to_sqlite] cvt_rst[pudl.convert.metadata_to_rst] ds_cli[pudl.workspace.datastore CLI] svc_cli[pudl.analysis.service_territory CLI] defs_pkg[pudl.defs] dbt_wrapper[pudl.dbt_wrapper] end subgraph New_Dagster_Home[pudl.dagster] io2[pudl.dagster.io_managers] prov2[pudl.dagster.provenance] res2[pudl.dagster.resources] parts[pudl.dagster.partitions] cfg[pudl.dagster.config] checks2[pudl.dagster.asset_checks] jobs[pudl.dagster.jobs] build[pudl.dagster.build] static2[pudl.dagster.assets.core.static] glue2[pudl.dagster.assets.core.glue] eiaapi[pudl.dagster.assets.core.eiaapi_electricity] sqlite2[pudl.dagster.assets.raw.ferc_to_sqlite] sensors[pudl.dagster.sensors] end subgraph Validate_Package[pudl.validate subpackage] v_integrity[pudl.validate.integrity] v_quality[pudl.validate.quality] v_dbt[pudl.validate.dbt] end subgraph Non_Dagster_Destinations[Non-Dagster destinations] fk_cli[pudl.scripts.pudl_check_fks] census_ext[pudl.extract.censusdp1tract] meta_script[pudl.scripts.metadata_to_rst] ds_script[pudl.scripts.pudl_datastore] svc_script[pudl.scripts.pudl_service_territories] definitions[pudl.definitions] end io --> io2 prov --> prov2 res --> res2 settings -->|partitions only| parts etlinit --> cfg etlinit --> jobs etlinit --> build etlchecks --> checks2 static --> static2 glue --> glue2 bulk -->|renamed| eiaapi sqlite --> sqlite2 ferceqr -->|sensor only| sensors fk -->|FK checker and helpers| v_integrity fk -->|CLI entrypoint| fk_cli etlchecks -->|no_null_rows, weighted_quantile| v_quality dbt_wrapper --> v_dbt cvt_census --> census_ext cvt_rst --> meta_script ds_cli --> ds_script svc_cli --> svc_script defs_pkg -->|lazy defs via pudl.dagster.__init__| definitions