-
-
Notifications
You must be signed in to change notification settings - Fork 133
Modernize Dagster config and launch scripts #5071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 122 commits
6d78dad
d52d692
5bec8a6
84d5f4c
2aed859
385dda4
1e51c15
ac5593b
ff4a225
53a5243
a2fde1d
e2f96eb
8e28206
0277ba8
7ba56a9
f6b9a44
6c8943b
ac870a4
4cc8aea
816cfb3
1ac1395
9711d55
d0ea8b1
9629523
2dbbaad
358ea9e
1162a33
7623fc9
05fd868
b559fed
91bee34
6b3f2cc
1008a05
3dafab9
97395e9
c10a636
0c42550
e5ede53
8654bb2
01cd3aa
7fb0671
9db264c
50bdc7e
b6a4757
5c00a68
ac9fcac
fdfc733
9c67474
eda9d5e
159dca4
8df09be
74afffc
76a81c8
3c1e9d1
83ccaee
d5ffd2f
7d9228e
1fcd285
28271d4
953c07f
2c7be8d
297e7de
ce817b7
439d928
de5dfe5
70a1a8c
d323c67
0512dea
0184386
4f25f7d
3850ca6
ff933f7
a099964
8998d82
7288aef
fa10581
be72661
7f4a1ff
52799fa
7629c91
c9f3910
10b9a6f
e576959
328b496
5c0d666
8262bca
a64b40f
23cd65c
a40389a
6cf9792
9688758
17a82ac
a9ffe0a
cc81482
93bdfc3
aca012a
713e302
63f3c9f
7436f0e
3d8f717
2708875
9fd5b66
05f1d1c
98b51b9
972746a
62b6f3b
2123a8b
e20fb13
94de207
8ba5b1d
27c135e
282a7d1
8414eba
54d5e14
03fac46
388fe2b
6e11f1e
f14143b
2920acc
ac8e50f
ca2ca75
a688cfd
100e0a0
c6c6d34
f920c46
94078e7
5cdc1a9
3b676a9
7db9710
75c363e
b77c99b
6d95192
7fe445e
ee10c20
d11d73f
a61ce5e
1efac27
5fd91ea
4f2f9bf
b8cb6d5
ecba021
39117fe
9db9a5d
c80b53f
7999c70
2263ea9
52ca219
3bb3d4b
b4e26e5
642852a
4a216cf
ade829e
2fed073
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Copy this file to .env and customize values for your local environment. | ||
| # The .env file is ignored by git. | ||
|
|
||
| # Required workspace paths | ||
| PUDL_INPUT=/absolute/path/to/pudl-input | ||
| PUDL_OUTPUT=/absolute/path/to/pudl-output | ||
| DAGSTER_HOME=/absolute/path/to/dagster-home | ||
|
|
||
| # Logging controls (read by pudl.logging_helpers.configure_root_logger) | ||
| PUDL_LOGLEVEL=INFO | ||
| PUDL_COLOR_LOGS=true | ||
|
|
||
| # Optional: write logs to a file in addition to console output. | ||
| # Leave unset for console-only logging. | ||
| # PUDL_LOGFILE=/absolute/path/to/pudl/logs/pudl.log | ||
|
|
||
| # Optional: don't try and use intersphinx to link to external documentation | ||
| # during the docs build -- it can be flaky and isn't required for most docs edits. | ||
| # PUDL_DOCS_DISABLE_INTERSPHINX=1 | ||
|
|
||
| # Optional: don't remove generated rst files after the docs build. Can be helpful | ||
| # when debugging formatting errors. | ||
| # PUDL_DOCS_KEEP_GENERATED_FILES=1 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| skills/** |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -139,7 +139,7 @@ jobs: | |
| --container-env OMP_NUM_THREADS=4 \ | ||
| --container-env PUDL_BOT_PAT=${{ secrets.PUDL_BOT_PAT }} \ | ||
| --container-env PUDL_GCS_OUTPUT=${{ env.PUDL_GCS_OUTPUT }} \ | ||
| --container-env PUDL_SETTINGS_YML="/home/ubuntu/pudl/src/pudl/package_data/settings/etl_full.yml" \ | ||
| --container-env DG_NIGHTLY_CONFIG="src/pudl/package_data/settings/dg_nightly.yml" \ | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just switching to the new dagster config file, which defines some additional dagster settings, and references the same PUDL ETL settings file we were using before internally. |
||
| --container-env SLACK_TOKEN=${{ secrets.PUDL_DEPLOY_SLACK_TOKEN }} \ | ||
| --container-env ZENODO_SANDBOX_TOKEN_PUBLISH=${{ secrets.ZENODO_SANDBOX_TOKEN_PUBLISH }} \ | ||
| --container-env ZENODO_TARGET_ENV=${{ (startsWith(github.ref_name, 'v20') && 'production') || 'sandbox' }} \ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,17 +7,23 @@ on: | |
| types: [created] | ||
|
|
||
| env: | ||
| username: ${{ github.event.issue.user.login }} | ||
| url: ${{ github.event.issue.html_url }} | ||
| username: "" | ||
| url: "" | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved these down into the jobs because of a linting error, which highlighted that at this point in the workflow, we don't yet know if we're looking at an issue or a discussion, and so there's no single correct value to look at. Down below we know which kind of event we're in so we can use the right value. Empty strings prevent "potentially undefined" warnings. |
||
| org: catalyst-cooperative | ||
|
|
||
| jobs: | ||
| com-dev-notify: | ||
| name: Notify Catalyst of community activity | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Get username if an issue was opened | ||
| if: ${{ github.event_name == 'issues' }} | ||
| run: | | ||
| echo "username=${{ github.event.issue.user.login }}" >> "${GITHUB_ENV}" | ||
| echo "url=${{ github.event.issue.html_url }}" >> "${GITHUB_ENV}" | ||
|
|
||
| - name: Get username if a discussion was created | ||
| if: ${{ (github.event_name == 'discussion') }} | ||
| if: ${{ github.event_name == 'discussion' }} | ||
| run: | | ||
| echo "username=${{ github.event.discussion.user.login }}" >> "${GITHUB_ENV}" | ||
| echo "url=${{ github.event.discussion.html_url }}" >> "${GITHUB_ENV}" | ||
|
|
@@ -36,13 +42,11 @@ jobs: | |
| uses: slackapi/slack-github-action@v3 | ||
|
|
||
| with: | ||
| # Slack channel id, channel name, or user id to post message. | ||
| # See also: https://api.slack.com/methods/chat.postMessage#channels | ||
| # You can pass in multiple channels to post to by providing a comma-delimited list of channel IDs. | ||
| channel-id: "community-dev" | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Deprecated syntax was giving an error/ warning. |
||
| # For posting a markdown message | ||
| method: chat.postMessage | ||
| token: ${{ secrets.COMMUNITY_DEV_SLACK_BOT_TOKEN }} | ||
| payload: | | ||
| { | ||
| "channel": "community-dev", | ||
| "blocks": [ | ||
| { | ||
| "type": "section", | ||
|
|
@@ -53,5 +57,3 @@ jobs: | |
| } | ||
| ] | ||
| } | ||
| env: | ||
| SLACK_BOT_TOKEN: ${{ secrets.COMMUNITY_DEV_SLACK_BOT_TOKEN }} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,8 +20,8 @@ jobs: | |
| permissions: | ||
| pull-requests: read | ||
| outputs: | ||
| # 2025-07-17: because merge_group is an Object and run_code_checks is not a conditional, need to explicitly check for null-ness instead of relying on truthiness. | ||
| run_code_checks: ${{ github.event_name =='workflow_dispatch' || (steps.filter.outputs.code == 'true' && (github.event.merge_group != null)) }} | ||
| # Run code checks for manual dispatches and merge queue runs with code changes. | ||
| run_code_checks: ${{ github.event_name == 'workflow_dispatch' || (github.event_name == 'merge_group' && steps.filter.outputs.code == 'true') }} | ||
|
Comment on lines
-23
to
+24
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Found a more appropriate element of the github environment to check on here. |
||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| with: | ||
|
|
@@ -47,6 +47,7 @@ jobs: | |
| - '!.github/workflows/bot-auto-merge.yml' | ||
| - '!.github/workflows/build-deploy-docs.yml' | ||
| - '!.github/workflows/build-deploy-pudl.yml' | ||
| - '!.github/workflows/deploy-pudl.yml' | ||
| - '!.github/workflows/com-dev-notify.yml' | ||
| - '!.github/workflows/docker-build-test.yml' | ||
| - '!.github/workflows/q-update-issue-scheduler.yml' | ||
|
|
@@ -77,7 +78,7 @@ jobs: | |
| run: | | ||
| echo "event name (${{ github.event_name }}) is workflow dispatch: ${{ github.event_name == 'workflow_dispatch' }}" | ||
| echo "found code changes: ${{ steps.filter.outputs.code }}" | ||
| echo "merge_group ${{ github.event.merge_group }} is not null: ${{ github.event.merge_group != null }}" | ||
| echo "event name (${{ github.event_name }}) is merge_group: ${{ github.event_name == 'merge_group' }}" | ||
|
|
||
| ci-docs: | ||
| permissions: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,6 +72,7 @@ repos: | |
| |.*\.bib | ||
| |.*\.csv | ||
| |.*\.html | ||
| |.*\.json | ||
| |src/pudl/package_data/ferc1/.*_categories\.yaml | ||
| )$ | migrations/ | devtools/ | test/ | notebooks/ | src/pudl/metadata/codes.py | src/pudl/transform/params/ferc1.py | ||
| args: [] # Make this read, not write | ||
|
|
@@ -131,6 +132,15 @@ repos: | |
| always_run: false | ||
| entry: pixi run jupyter nbconvert --clear-output | ||
|
|
||
| - id: pixi-lock-update | ||
| name: pixi-lock-update | ||
| stages: [pre-commit] | ||
| language: system | ||
| verbose: false | ||
| pass_filenames: false | ||
| always_run: true | ||
| entry: bash -c 'pixi install --quiet && git add pixi.lock' | ||
|
Comment on lines
+135
to
+142
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This handles the annoying little round-trip that happens when you run the tests and as a result pixi updates the hash and version in |
||
|
|
||
| - id: unit-tests | ||
| name: unit-tests | ||
| stages: [pre-commit] | ||
|
|
@@ -150,5 +160,5 @@ ci: | |
| autoupdate_branch: main | ||
| autoupdate_commit_msg: "[pre-commit.ci] pre-commit autoupdate" | ||
| autoupdate_schedule: weekly | ||
| skip: [unit-tests, nb-output-clear, shellcheck] | ||
| skip: [pixi-lock-update, unit-tests, nb-output-clear, shellcheck] | ||
| submodules: false | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My goal with the updates to
Beyond ~500 lines the general consensus seems to be that you want to start using "progressive disclosure" to be more selective about what gets loaded into context. The initial pattern for that in Nov/Dec of last year was to have additional domain-specific The most helpful guide I came across for working on this file was Claude's Skill Authoring Best Practices. The AGENTS.md website was unreasonably vague. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,112 +1,69 @@ | ||
| # LLM coding agent instructions for the Public Utility Data Liberation (PUDL) Project | ||
|
|
||
| ## PUDL Project Overview | ||
|
|
||
| - The PUDL Project implements a data processing pipeline that ingests raw energy system | ||
| data from public agencies like the US Energy Information Administration (EIA) and the | ||
| Federal Energy Regulatory Commission (FERC) and transforms it into clean, well | ||
| organized tables for use in analysis and modeling. | ||
| - PUDL uses the Dagster data orchestration framework to manage dependencies between | ||
| different assets, and to enable parallel execution of different portions of the data | ||
| processing pipeline. | ||
| - The raw input data for the PUDL data processing pipeline can be found in the directory | ||
| indicated by the `$PUDL_INPUT` environment variable. The raw inputs are downloaded as | ||
| needed by the data pipeline, but can be pre-downloaded in bulk using the | ||
| `pudl_datastore` command line interface. | ||
| - The PUDL data processing pipeline primarily generates Apache Parquet files as its | ||
| outputs. These outputs can be found in `$PUDL_OUTPUT/parquet/` where `$PUDL_OUTPUT` is | ||
| an environment variable which should be set by the user. | ||
|
|
||
| ## Development environment tips | ||
|
|
||
| - PUDL uses pixi to manage its Python environment and dependencies. All dependencies and | ||
| configuration are defined in `pyproject.toml`. | ||
| - The default pixi environment includes all development tools. | ||
| - To run commands in the pixi environment, prefix them with `pixi run` (e.g., | ||
| `pixi run pytest`) | ||
| - Pixi environments and tasks are defined in `pyproject.toml` under `[tool.pixi]` | ||
| sections. | ||
| - PUDL uses ruff to lint and automatically format python code. Before staging files for | ||
| a commit, always run `pixi run pre-commit run ruff-check --all-files` and | ||
| `pixi run pre-commit run ruff-format --all-files` | ||
| - A number of pre-commit hooks are defined in .pre-commit-config.yaml. | ||
| - We try to use appropriate type annotations in function, class, and method definitions, | ||
| but they are not yet checked or enforced. They are primarily to improve readability | ||
| for humans, LLMs, and IDEs. | ||
| ## PUDL project overview | ||
|
|
||
| - PUDL ingests raw public energy data (EIA, FERC, EPA, and others) and transforms it | ||
| into clean, analysis-ready tables. | ||
| - The pipeline is orchestrated using Dagster assets and jobs. | ||
| - Raw inputs are managed through the datastore and are rooted at `$PUDL_INPUT`. | ||
| - Primary outputs are Parquet files rooted at `$PUDL_OUTPUT/parquet/`. | ||
|
|
||
| ## Python environment and tooling | ||
|
|
||
| - PUDL uses `pixi` for dependency and task management. | ||
| - Use `pixi run <command>` to ensure commands run in the project environment. | ||
| - Never try to create or manage Python environments manually; always use `pixi` to | ||
| ensure consistency. | ||
| - Project tasks and environments are defined in `pyproject.toml` under `[tool.pixi]`. | ||
| - Git pre-commit hooks are defined in `.pre-commit-config.yaml`. | ||
|
|
||
| ## Available skills | ||
|
|
||
| There are a number of skills defined in skills-lock.json that should be available to you. | ||
| If they're not available, use `pixi run install-skills` to install them. | ||
|
|
||
| ## Testing instructions | ||
|
|
||
| - PUDL uses pytest to manage its unit and integration tests. | ||
| - Tests should avoid using unittest and monkeypatch, and use pytest-mock. | ||
| - Rather than enumerating various test cases within a single test function, the | ||
| tests should use the pytest.parametrize decorator to enumerate tests cases, specifying | ||
| the appropriate success or failure or exception to be raised for each test as | ||
| appropriate. | ||
| - Tests must be run inside the pixi environment. | ||
| - When individual tests are run, we should turn off coverage collection, since otherwise | ||
| they will fail since they only cover a small portion of the codebase. | ||
| - Test coverage collection should be disabled using `--no-cov` when running individual | ||
| tests to avoid getting spurious warnings. | ||
| - For example, the unit tests can be run with `pixi run pytest --no-cov test/unit`. | ||
| - We use dbt only for data validation, and NOT for data transformations. The PUDL data | ||
| tests are under the `dbt/` directory. | ||
| - dbt commands must be typically run from within the dbt directory, e.g.: | ||
| `cd dbt && pixi run dbt build` | ||
| - The PUDL integration tests process a substantial amount of data and take up to an hour | ||
| to run, and so should not generally be run during development interactively. | ||
|
|
||
| ## Code Style Guidelines | ||
|
|
||
| - Follow pandas naming conventions: use `df` for DataFrames, descriptive column names | ||
| - Prefer longer, readable, descriptive variable names over short, cryptic ones. | ||
| - Use explicit type hints for function parameters and returns where helpful. | ||
| - Prefer method chaining for pandas operations when it improves readability. | ||
| - Use `pathlib.Path` for file system operations instead of string concatenation. | ||
| - Follow snake_case for functions/variables, PascalCase for classes. | ||
| - Use f-strings for string formatting, including in logging statements. | ||
| - Write docstrings for all public functions/classes using Google style python | ||
| docstrings. | ||
| - Limit lines to 88 characters for better readability. Do not artificially restrict line | ||
| length to 80 characters. | ||
| - Do not use `print()` statements; use python's logging system instead. | ||
|
|
||
| ## PUDL-Specific Patterns | ||
|
|
||
| - Asset dependencies in Dagster should be explicit and well-documented | ||
| - In general, data validation should happen in dbt, not in Dagster asset checks. | ||
| - Sanity checks that validate assumptions about the data should be done as it is being | ||
| transformed, with assertions failing loudly if expectations are not met. | ||
| - Use PUDL's existing utility functions in `pudl.helpers` when available. | ||
| - Raw data access should use the datastore pattern, not direct file I/O. | ||
| - Use nullable pandas dtypes (e.g. `pd.Int64Dtype()` or `pd.StringDtype()`) when | ||
| possible, to avoid generic `object` dtypes and mixed NULL values. | ||
| - Parquet outputs should use snappy compression and pyarrow dtypes. | ||
| - Metadata describing the tables, columns, and data sources can be found in the | ||
| `pudl.metadata` subpackage. "Resources" are tables and "Fields" are columns. | ||
| - Metadata classes defined in the `pudl.metadata.classes` module using Pydantic | ||
| generally mirror the frictionless datapackage standard. | ||
| - Our documentation is built using Sphinx. The source files are in the `docs/` | ||
| directory. The source files are in reStructuredText format. | ||
| - Whenever we make significant changes to the codebase, they should be noted in the PUDL | ||
| release notes found at `docs/release_notes.rst`. | ||
|
|
||
| ## Performance Considerations | ||
|
|
||
| - Use vectorized pandas operations instead of row-wise `apply` or loops. | ||
| - Consider using just-in-time compilation with numba for performance-critical code. | ||
| - Do not use inplace operations on pandas DataFrames. | ||
| - Avoid chained indexing in pandas to prevent SettingWithCopyWarning. | ||
| - Use efficient pandas merging and joining techniques, ensuring indexes are set | ||
| appropriately. | ||
| - Avoid creating unnecessary intermediate DataFrames. | ||
| - Use categorical dtypes for columns with a limited set of values to save memory. | ||
| - Profile and optimize any code that processes large datasets. | ||
| - PUDL relies primarily on pandas for data processing, but in cases where performance or | ||
| memory limitations are important, we may also use DuckDB or polars dataframes. | ||
| - For large datasets (>1GB), consider polars for aggregations before pandas. | ||
| - Use polars for memory-intensive operations or when pandas performance is limiting. | ||
| ## Sandbox safe execution | ||
|
|
||
| - Prefer already-installed binaries before invoking commands that may trigger package | ||
| resolution or updates. | ||
| - Prefer direct binaries (for example `dg`, `rg`, `ruff`) when they are already | ||
| available in the active environment. | ||
| - When using `pixi run`, prefer frozen/locked execution modes that avoid dependency | ||
| updates. | ||
| - For sandboxed terminal runs, keep cache and temporary directories writable and local | ||
| to the workspace when possible, e.g. `TMPDIR`, `PIXI_HOME`. | ||
|
|
||
| ## Always-on coding expectations | ||
|
|
||
| - Prefer explicit, readable code and descriptive names over terse names. | ||
| - Follow existing naming conventions and data model conventions. | ||
| - Reuse existing project helpers and established patterns before introducing new ones. | ||
| - Prefer dbt for data validation by default; use Dagster/Python validation only when | ||
| there is a clear project-specific reason. | ||
|
|
||
| ## Where to find additional detailed instructions | ||
|
|
||
| - If the `pudl` agent skill is enabled, it should be used to read PUDL database schemas | ||
| and descriptions of tables and columns. | ||
| - If the `pudl-dev` agent skill is enabled, it should be used to inform software | ||
| development tasks in the PUDL project. | ||
| - If the `dagster-expert` agent skill is enabled, it should be used when adding or | ||
| modifying code related to orchestration of the data processing pipeline, and any of | ||
| the concepts and classes defined by Dagster. This includes assets, resources, jobs, | ||
| IO managers, sensors, etc. | ||
|
|
||
| ## Developer documentation references | ||
|
|
||
| - General contributor docs: `docs/dev/` | ||
| - Python testing: `docs/dev/testing.rst` | ||
| - Data validation quickstart and reference using dbt: | ||
| `docs/dev/data_validation_quickstart.rst`, | ||
| `docs/dev/data_validation_reference.rst` | ||
| - Dagster development: `docs/dev/dev_dagster.rst`, `docs/dev/run_the_etl.rst` | ||
| - Editing PUDL Metadata: `docs/dev/metadata.rst` | ||
| - PUDL naming conventions: `docs/dev/naming_conventions.rst` | ||
|
|
||
| ## Release notes | ||
|
|
||
| - Significant user-visible or developer-visible changes should be summarized in | ||
| `docs/release_notes.rst`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,8 +31,8 @@ function run_ferceqr_etl() { | |
| authenticate_gcp && | ||
| dagster dev & | ||
|
|
||
| # Kick off the ferceqr_etl job asynchronously | ||
| dagster job backfill --noprompt -j ferceqr_etl --location pudl.etl | ||
| # Kick off the ferceqr job asynchronously | ||
| dagster job backfill --noprompt -j ferceqr --location pudl.etl | ||
|
Comment on lines
-34
to
+35
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update job name. |
||
| # Wait for a file called 'SUCCESS' or 'FAILURE' to be created in PUDL_OUTPUT indicating completion | ||
| # Timeout after 6 hours if file still doesn't exist | ||
| inotifywait -e create -t 21600 --include 'SUCCESS|FAILURE' "$PUDL_OUTPUT" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm predefining environment variables in several workflows to get rid of "potentially invalid reference" linting errors that were cluttering things up. Should have no actual impact.