Skip to content

Organize canonical pudl.dagster registry... and other things#5124

Open
zaneselvans wants to merge 46 commits intodagster-housekeepingfrom
canonical-defs
Open

Organize canonical pudl.dagster registry... and other things#5124
zaneselvans wants to merge 46 commits intodagster-housekeepingfrom
canonical-defs

Conversation

@zaneselvans
Copy link
Copy Markdown
Member

@zaneselvans zaneselvans commented Mar 23, 2026

Overview

This PR is about rationalizing the organization of the PUDL repo's Dagster-specific code and CLI entry points. It does not (should not!) change any of the actual Dagster code or behavior, just where it lives in the repo and how it's organized.

In the process, some non-Dagster code that had been living in sin with Dagster was evicted and found a new home in packages further from the city center, leading to a much longer commute. In some cases whole modules or packages were demolished, and replaced with Dagster luxury apartments or clean script wrappers. Some names have been changed to protect the innocent and ensure that object references still make sense in their new contexts. In a subsequent PR we will definitely start charging for on-street parking in the new Dagster neighborhood, so this is the last chance to get free parking for your Dagster code!

The new pudl.dagster package

What's in there?

The new dedicated Dagster orchestration package at pudl.dagster is structured like this:

  • pudl.dagster.asset_checks for asset-check definitions, factories and helpers
  • pudl.dagster.assets for asset definitions and registration (only the oddball assets previously defined in pudl.etl)
  • pudl.dagster.build tools for dynamically assembling dagster.Definitions
  • pudl.dagster.config for reusable Dagster run-config fragments and helpers
  • pudl.dagster.io_managers for the Dagster IO managers that read and write PUDL assets and prerequisite SQLite databases
  • pudl.dagster.jobs for the standard named Dagster jobs like pudl, ferc_to_sqlite, and ferceqr
  • pudl.dagster.partitions for shared partition definitions used across assets, checks, and automation
  • pudl.dagster.provenance for FERC SQLite provenance metadata and compatibility checking helpers
  • pudl.dagster.resources for Dagster ConfigurableResource definitions and the default resource mapping
  • pudl.dagster.sensors for sensor-based automation like the FERC EQR deployment sensor

The package exposes its public surface (e.g. build_defs, defs, default_assets, …) via lazy imports in pudl.dagster.__init__, so importing the package name does not trigger the full Dagster import chain.

Where'd it come from?

Dagster-related code that was previously scattered throughout the repo was collected under pudl.dagster, in some cases resulting in elimination of existing modules or packages. Relocated code included:

  • pudl.io_managerspudl.dagster.io_managers
  • pudl.ferc_sqlite_provenancepudl.dagster.provenance
  • pudl.resourcespudl.dagster.resources
  • pudl.settingspudl.dagster.partitions (partitions only!)
  • pudl.etl.static_assetspudl.dagster.assets.core.static
  • pudl.etl.glue_assetspudl.dagster.assets.core.glue
  • pudl.etl.eia_bulk_elec_assetspudl.dagster.assets.core.eiaapi_electricity (also renamed to drop the stale eia_bulk_elec prefix)
  • pudl.etl.ferc_to_sqlite_assetspudl.dagster.assets.raw.ferc_to_sqlite
  • pudl.deploy.ferceqrpudl.dagster.sensors (sensor only!)
  • pudl.etlpudl.dagster.config
  • pudl.etl.asset_checkspudl.dagster.asset_checks
  • pudl.etlpudl.dagster.jobs
  • pudl.etlpudl.dagster.build

No longer living in sin with Dagster 😈

  • pudl.etl.check_foreign_keyspudl.validate.integrity (foreign key checking function, error reporting, exception classes)
  • pudl.etl.check_foreign_keyspudl.scripts.pudl_check_fks (foreign key check script entrypoint / CLI logic)
  • pudl.etl.asset_checkspudl.validate.quality (no_null_rows and weighted_quantile data quality helpers)
  • pudl.dbt_wrapperpudl.validate.dbt (dbt invocation wrapper with failure context reporting)

Re-organization of the pudl.validate package

pudl.validate is now a subpackage rather than a flat module, providing a structured home for all PUDL data validation logic organized by framework and approach:

  • pudl.validate.dbt — dbt invocation wrappers (build_with_context, dagster_to_dbt_selection, etc.), relocated from the top-level pudl.dbt_wrapper module
  • pudl.validate.integrity — database integrity checks (foreign key validation classes and check_foreign_keys), split out from the old flat pudl.validate module
  • pudl.validate.quality — bespoke data quality utilities (no_null_rows, weighted_quantile), split out from the old flat pudl.validate module

The package __init__.py contains only a docstring — all callers have been updated to import from the specific submodule.

Consolidated CLI scripts under pudl.scripts

What changed?

All PUDL console script entry points are now defined as thin wrappers in pudl.scripts, with substantive logic remaining in the modules that own it. Previously several scripts had their CLI code mixed in with non-CLI module logic scattered across the package. Two wayward CLIs were relocated to pudl.scripts:

  • pudl.workspace.datastore (the pudl_datastore CLI) → pudl.scripts.pudl_datastore
  • pudl.analysis.service_territory (the pudl_service_territories CLI) → pudl.scripts.pudl_service_territories

All entry point functions are standardized to be named main, and all scripts now use if __name__ == "__main__": sys.exit(main()) to propagate exit status to the shell.

The pudl_datastore CLI also gained a --all flag that operates on every known dataset without requiring a hardcoded list, replacing the old --dataset/-d flag with a positional nargs=-1 argument.

Structural conventions enforced by tests

A new test module test/unit/scripts/test_scripts_conventions.py enforces two structural rules:

  • No @click.command or @click.group decorators outside src/pudl/scripts/
  • Every CLI script module contains if __name__ == "__main__": sys.exit(main())

Dissolved pudl.convert subpackage

The pudl.convert subpackage was a loose collection of two unrelated utilities. Each module was moved to a more semantically appropriate home:

  • pudl.convert.censusdp1tract_to_sqlitepudl.extract.censusdp1tract (Census DP1 conversion is logically an extraction step)
  • pudl.convert.metadata_to_rstpudl.scripts.metadata_to_rst (documentation-generation script, not a data conversion utility)

Dissolved pudl.defs placeholder package

The pudl.defs package previously existed solely as an empty placeholder to make dg check defs work. With the lazy-loading wiring added to pudl.dagster.__init__, the defs object is assembled there and re-exported from the existing pudl.definitions module, which remains the stable dg-compatible entry point. The pudl.defs package is no longer needed.

Other Changes

  • docs, imports, script entrypoints, and tests were updated to match the new structure
  • added test/unit/dagster_asset_checks_test.py to verify behavior of the runtime types expected by generated asset checks. This was causing a type warning which makes it tempting to re-write naively, but actually the runtime typing is necessary to get the correct behavior from the generated checks, so this test is intended to prevent future accidental breakage of that behavior.
  • dg.* import prefix is now used canonically for all dagster imports within the registry
  • The zenodo-cache-sync workflow was updated to use pudl_datastore --all instead of relying on a default of downloading all datasets when none are explicitly provided

Migration map

flowchart LR
  subgraph Old_Locations[Previous locations]
    io[pudl.io_managers]
    prov[pudl.ferc_sqlite_provenance]
    res[pudl.resources]
    settings[pudl.settings]
    etlinit[pudl.etl]
    etlchecks[pudl.etl.asset_checks]
    static[pudl.etl.static_assets]
    glue[pudl.etl.glue_assets]
    bulk[pudl.etl.eia_bulk_elec_assets]
    sqlite[pudl.etl.ferc_to_sqlite_assets]
    ferceqr[pudl.deploy.ferceqr]
    fk[pudl.etl.check_foreign_keys]
    cvt_census[pudl.convert.censusdp1tract_to_sqlite]
    cvt_rst[pudl.convert.metadata_to_rst]
    ds_cli[pudl.workspace.datastore CLI]
    svc_cli[pudl.analysis.service_territory CLI]
    defs_pkg[pudl.defs]
    dbt_wrapper[pudl.dbt_wrapper]
  end

  subgraph New_Dagster_Home[pudl.dagster]
    io2[pudl.dagster.io_managers]
    prov2[pudl.dagster.provenance]
    res2[pudl.dagster.resources]
    parts[pudl.dagster.partitions]
    cfg[pudl.dagster.config]
    checks2[pudl.dagster.asset_checks]
    jobs[pudl.dagster.jobs]
    build[pudl.dagster.build]
    static2[pudl.dagster.assets.core.static]
    glue2[pudl.dagster.assets.core.glue]
    eiaapi[pudl.dagster.assets.core.eiaapi_electricity]
    sqlite2[pudl.dagster.assets.raw.ferc_to_sqlite]
    sensors[pudl.dagster.sensors]
  end

  subgraph Validate_Package[pudl.validate subpackage]
    v_integrity[pudl.validate.integrity]
    v_quality[pudl.validate.quality]
    v_dbt[pudl.validate.dbt]
  end

  subgraph Non_Dagster_Destinations[Non-Dagster destinations]
    fk_cli[pudl.scripts.pudl_check_fks]
    census_ext[pudl.extract.censusdp1tract]
    meta_script[pudl.scripts.metadata_to_rst]
    ds_script[pudl.scripts.pudl_datastore]
    svc_script[pudl.scripts.pudl_service_territories]
    definitions[pudl.definitions]
  end

  io --> io2
  prov --> prov2
  res --> res2
  settings -->|partitions only| parts
  etlinit --> cfg
  etlinit --> jobs
  etlinit --> build
  etlchecks --> checks2
  static --> static2
  glue --> glue2
  bulk -->|renamed| eiaapi
  sqlite --> sqlite2
  ferceqr -->|sensor only| sensors

  fk -->|FK checker and helpers| v_integrity
  fk -->|CLI entrypoint| fk_cli
  etlchecks -->|no_null_rows, weighted_quantile| v_quality
  dbt_wrapper --> v_dbt

  cvt_census --> census_ext
  cvt_rst --> meta_script
  ds_cli --> ds_script
  svc_cli --> svc_script
  defs_pkg -->|lazy defs via pudl.dagster.__init__| definitions
Loading

@zaneselvans zaneselvans changed the title Make pudl.defs the canonical Dagster registry Make pudl.dagster the canonical Dagster registry Mar 23, 2026
@zaneselvans zaneselvans added dagster Issues related to our use of the Dagster orchestrator developer experience Things that make the developers' lives easier, but don't necessarily directly improve the data. agents Issues related to working with PUDL data or code using LLM/AI based coding agents labels Mar 23, 2026
@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Mar 23, 2026
@zaneselvans zaneselvans self-assigned this Mar 23, 2026
Move the canonical Dagster assembly from pudl.defs into a new pudl.dagster package, split the old registry module by Dagster abstraction, and update the stable code location plus internal imports to point at the new package layout.
zaneselvans

This comment was marked as outdated.

@zaneselvans zaneselvans linked an issue Mar 23, 2026 that may be closed by this pull request
- convert pudl.dagster.assets from a single module into a package
- move the orphan asset-definition modules out of pudl.etl and into:
  - pudl.dagster.assets.core.glue
  - pudl.dagster.assets.core.static
  - pudl.dagster.assets.core.eia_bulk_elec
  - pudl.dagster.assets.raw.ferc_to_sqlite
- preserve the existing asset loader surface from pudl.dagster.assets
  while updating it to import from the new core/ and raw/ subpackages
- update Dagster configuration so dg resolves the canonical defs module
  from pudl.dagster and uses pudl.definitions as the code location entrypoint
- update metadata doc references to the new glue module paths
- update focused unit tests to import and patch the moved asset modules
- keep pixi.lock in sync with the local editable package metadata after validation
- Move reusable foreign key validation helpers into pudl.validate and
  move the pudl_check_fks CLI wrapper into pudl.scripts.
- Update the pudl_check_fks console entrypoint and dependent imports so
  foreign key checks keep working after removing pudl.etl.
- Move group_mean_continuity_check into pudl.dagster.asset_checks,
  where the EIA 860 and EIA 923 Dagster asset checks actually use it.
- Remove the now-empty pudl.etl package so the module layout matches
  current orchestration and validation responsibilities.
- Keep Dagster's runtime input annotation for schema asset checks and
  add a narrow type ignore because IO manager loading depends on it.
- Refresh pixi.lock to capture the package metadata update from these
  staged Python changes.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 92.00000% with 16 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dagster-housekeeping@8262bca). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/pudl/scripts/pudl_check_fks.py 68.42% 6 Missing ⚠️
src/pudl/validate.py 89.13% 5 Missing ⚠️
src/pudl/dagster/asset_checks.py 88.00% 3 Missing ⚠️
src/pudl/dagster/assets/__init__.py 95.24% 1 Missing ⚠️
src/pudl/dagster/config.py 91.67% 1 Missing ⚠️
Additional details and impacted files
@@                   Coverage Diff                   @@
##             dagster-housekeeping    #5124   +/-   ##
=======================================================
  Coverage                        ?   93.22%           
=======================================================
  Files                           ?      228           
  Lines                           ?    19357           
  Branches                        ?        0           
=======================================================
  Hits                            ?    18045           
  Misses                          ?     1312           
  Partials                        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Make pudl.dagster the single, explicit home for orchestration code so
  Dagster behavior is no longer scattered across the top-level package.
- Use lazy imports for the dagster subpackage to avoid circular imports
  that have caused problems in the past in other contexts.
- Keep the FERC SQLite provenance checks next to the assets and IO
  managers that consume them, since they are part of the runtime
  contract that tells us whether a stored SQLite artifact can be reused
  safely in the current Dagster run.
- Treat the FERC and PUDL IO managers as Dagster orchestration
  infrastructure rather than generic helpers, because they control how
  assets read prerequisite SQLite outputs and write managed Parquet and
  SQLite artifacts.
- Leave only genuinely shared helpers at the top level so package
  boundaries better match architectural responsibility.
- Update tests and developer docs to reinforce pudl.dagster as the
  stable orchestration API.
- Move Dagster resources into pudl.dagster.resources and remove
  pudl.resources.
- Add dedicated pudl.dagster.partitions and pudl.dagster.sensors
  modules for FERC EQR orchestration state.
- Update IO managers, extractors, glue code, and tests to use the
  new Dagster imports and resource names.
- Expand module docstrings and update developer docs and release
  notes to point at the canonical Dagster package.
- Clarify the pytest-integration Pixi task description to mention
  data validation coverage.
@zaneselvans zaneselvans changed the title Make pudl.dagster the canonical Dagster registry Organize canonical Dagster registry under pudl.dagster Mar 24, 2026
zaneselvans and others added 6 commits March 24, 2026 15:55
The pudl.convert subpackage was a loose collection of two unrelated
utilities that didn't belong together. This commit moves each module
to a more semantically correct home:

- censusdp1tract_to_sqlite.py → pudl.extract.censusdp1tract: The
  Census DP1 conversion is logically an extraction step (raw geodata
  to SQLite), so it belongs in pudl.extract alongside the other
  extractors that feed the pipeline.

- metadata_to_rst.py → pudl.scripts.metadata_to_rst: This is a
  documentation-generation script, not a data conversion utility.
  Moving it to pudl.scripts puts it with other standalone scripts.

Related doc and reference updates follow from these renames.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
zaneselvans and others added 13 commits March 28, 2026 23:12
Convert pudl.validate from a flat module into a subpackage with three
submodules organized by validation framework and concern:

- pudl.validate.dbt: dbt invocation wrappers, relocated from the
  top-level pudl.dbt_wrapper module
- pudl.validate.integrity: database integrity checks (ForeignKeyError,
  ForeignKeyErrors, check_foreign_keys), split from the old validate.py
- pudl.validate.quality: bespoke data quality utilities (no_null_rows,
  weighted_quantile, ExcessiveNullRowsError), split from the old validate.py

The package __init__.py contains only a docstring; callers import from
the specific submodule. All affected scripts, tests, and docs have been
updated to use the new import paths.

Also adds two commented-out pytestmark stubs in ferc_xbrl_extract_test.py,
glue_test.py, and ferc_sqlite_provenance_test.py as placeholders for a
planned ferc1_sqlite_provenance test label.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zaneselvans zaneselvans changed the title Organize canonical Dagster registry under pudl.dagster Organize canonical pudl.dagster registry... and other things Mar 30, 2026
Copy link
Copy Markdown
Member Author

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished the self-review. Small cleanup tasks that fell out of it:

  • Rationalize ETL settings class names and instance names and standardize across the whole repo.
  • Put get_ferc_form_name in pudl.dagster.io_managers with leading underscore, since that's the only place it's used.
  • Document the reason we need dagster_asset_checks_test.py regression test.
  • Review the remaining non-asset dagster imports happening outside of pudl.dagster and see if they ought to be pulled under pudl.dagster or if there's a good reason for them to be where they are (ferceqr, sec10k, record_linkage, ml_tools, validate/dbt, transform/epacems

.. code-block:: console

$ pudl_datastore --dataset eia860
$ pudl_datastore eia860
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mandatory args with no flag are more the standard CLI idiom, and we've already changed to this on the zenodo DOI udpater script.

- name: Update AWS S3 cache with any new Zenodo archives
run: |
pixi run pudl_datastore --cloud-cache-path ${{ env.CLOUD_CACHE_PATH }} --bypass-local-cache --loglevel DEBUG
pixi run pudl_datastore --all --cloud-cache-path ${{ env.CLOUD_CACHE_PATH }} --bypass-local-cache --loglevel DEBUG
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to explicit --all to run all available datasets.

Mandatory args with no flag are more the standard CLI idiom, and we've already changed to this on the zenodo DOI udpater script.

Comment on lines -568 to -570
################################################################################
# Provide a CLI for generating service territories
################################################################################
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI wrapper relocated to src/pudl/scripts/

_write_status_file("FAILURE")


@dg.sensor(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensor relocated to src/pudl/datster/sensors.py

logger = get_logger(__name__)


ferceqr_sensor_status = (
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensor relocated to src/pudl/datster/sensors.py

@@ -0,0 +1,34 @@
"""Regression tests for Dagster asset-check input typing."""
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our dataframe handling in IO Managers depends on getting the correct types. But the type annotation we have in the function that deals with this looks fishy, and either humans or agents could be prone to removing it as spurious. This test ensures that the tests break if they do. Need to give a better real explanation in the docstring of the what/why.

@@ -14,6 +14,8 @@
from pudl.settings import EtlSettings, FercToSqliteSettings
from pudl.workspace.datastore import ZenodoDoiSettings

# pytestmark: MarkDecorator = pytest.mark.ferc1_sqlite_provenance
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring this back after #5133

Comment on lines -37 to +55
- `etl/` — Dagster asset and job definitions, organized by data source; top-level
`defs` object and all jobs live in `pudl.etl`
- `dagster/` — all Dagster orchestration code; the canonical home for everything
Dagster-specific. Sub-structure:
- `dagster/asset_checks.py` — asset-check definitions, factories, and helpers
- `dagster/assets/` — oddball asset definitions that don't fit the per-source layout;
`assets/core/` for processed assets, `assets/raw/` for raw-extraction assets
- `dagster/build.py` — assembles the `dagster.Definitions` object
- `dagster/config.py` — reusable Dagster run-config fragments and helpers
- `dagster/io_managers.py` — IO managers for SQLite, Parquet, and FERC SQLite reads
- `dagster/jobs.py` — named Dagster jobs (`pudl`, `ferc_to_sqlite`, `ferceqr`)
- `dagster/partitions.py` — shared partition definitions
- `dagster/provenance.py` — FERC SQLite fingerprinting and compatibility checks
- `dagster/resources.py` — `ConfigurableResource` definitions and default resource map
- `dagster/sensors.py` — sensor-based automation (e.g. the FERC EQR sensor)
- `deploy/` — post-ETL deployment logic: publishing outputs to GCS/S3, updating
`nightly`/`stable` git branches, triggering Zenodo releases, applying GCS holds,
and redeploying the PUDL Viewer Cloud Run service. `ferceqr.py` contains Dagster
assets specific to the FERC EQR batch pipeline; `pudl.py` covers full PUDL builds.
- `scripts/` — all CLI entry points as thin wrappers; one module per script, each
exposing a `main` Click command. Registered in `[project.scripts]` in `pyproject.toml`
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update repo map to reflect new repo layout.

return all_strings.difference(old_strings)


def get_ferc_form_name(db_name: str) -> str:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function is used only in pudl.dagster.io_managers.py in which case I should put it (back) there and treat it as an internal-only helper with a leading underscore _get_ferc_form

Comment on lines -339 to +346
"etl_settings": etl_settings,
"zenodo_dois": zenodo_dois,
"etl_settings": pudl_etl_settings_resource,
"zenodo_dois": zenodo_doi_settings_resource,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the unification of the PUDL ETL Settings object, I want to rationalize the names for readability and make sure they're the same everywhere. EtlSettings and DatasetsSettings aren't great.

@zaneselvans zaneselvans marked this pull request as ready for review March 30, 2026 23:36
@zaneselvans zaneselvans requested a review from zschira March 30, 2026 23:36
@zaneselvans zaneselvans moved this from In progress to In review in Catalyst Megaproject Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Issues related to working with PUDL data or code using LLM/AI based coding agents dagster Issues related to our use of the Dagster orchestrator developer experience Things that make the developers' lives easier, but don't necessarily directly improve the data.

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Organize canonical pudl.dagster registry

1 participant