EIA191 rp8 data integration#5058
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Hi @e-belfer I opened this draft PR implementing the transform layer for Would love your feedback on the transform approach and column handling before I continue with the metadata/resource definitions. One thing I surfaced during the investigation: about 23% of rows have |
|
@irubey Hey! I'm planning to review this tomorrow and will take a look at this issue in particular, thanks for flagging it! |
There was a problem hiding this comment.
Hi @irubey!
This is awesome! Thanks for the detailed notebook.
I agree with your choice of primary key, and it's good to see that all the records without working_gas_capacity_mcf are inactive fields. Generally, the data looks pretty clean, with the exceptions you noted.
I tried to push a small extraction change to your PR, but seems like I don't have permission, so I've described the changes directly below.
First and foremost, we have more up to date EIA 191 data available. Let's go ahead and update the DOI in pudl/package_data/settings/zenodo_dois.yaml to be 10.5281/zenodo.18823073 to get the latest data in.
Next, you'll need to implement a small workaround to correctly rename the columns at the extraction step, in src.pudl.extract.eia191.py:
def process_raw(
self, df: pd.DataFrame, page: str, **partition: PartitionSelection
) -> pd.DataFrame:
"""Rename columns using `any_year` partition."""
return df.rename(
columns=self._metadata.get_column_map(page=page, year="any_year")
)
Since year and report_year are 100% identical, we can safely replace the existing process_raw function, which is currently adding the report_year column, and remove the drop step in the transform.
You'll also want to update the column names in src/pudl/package_data/eia191/column_maps/data/csv as follows:
| year_index | year | month | reservoir_state | operator_id_eia | gas_field_code | reservoir_code | company_name | field_name | reservoir_name | field_type | reservoir_county | reservoir_status | base_gas_mcf | working_gas_capacity_mcf | total_field_capacity_mcf | maximum_daily_delivery_mcf | storage_region |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| any_year | year | month | report_state | id | gas_field_code | reservoir_code | company_name | field_name | reservoir_name | field_type | county_name | status | base_gas | working_gas_capacity_mcf | total_field_capacity_mcf | maximum_daily_delivery_mcf | region |
This names the columns correctly at the source, rather than renaming them in the transform.
A few other suggestions:
- Data types will be enforced when the table is written to database, and they'll get defined in
field.py, so there's no need to manually coerce data types in the transform. - I recommend running the function
simplify_strings()from pudl.helpers on all string fields other than ID and state - it'll handle cases, extra whitespaces, and non-standard characters. status,regionandfield_typewill all want to get further constrained infields.py, using theenumsfield - seebyproduct_descriptionfor an example of how this looks.
Regarding the total vs working vs base gas distinction - we don't generally correct without a high degree of certainty, so I would do the same as you and pass through the data. Let's make sure we keep checking this relationship by writing this as a DBT validation at the end of the transform step and document the extent of failure using:
config:
error_if: ">X"
and leaving a note in the description. This will flag this for us to investigate further in the future, and easily flag for us if a new month of data has way more errors than are anticipated. You can also note this in the additional_details_text for the table-level metadata, and I'm happy to help wordsmith when you get there if you want a second brain.
Let me know if you have questions on any of this, otherwise ping me when you're ready for me to take a look again!
46075e5 to
b3229d1
Compare
|
Hi @e-belfer, Thanks for the detailed review and suggestions. I've pushed several commits addressing the first-checkpoint feedback and would appreciate another look. Implemented changesAll items from the first-checkpoint feedback are now implemented:
A few implementation details differed slightly from the suggestions, so I wanted to confirm they look reasonable. Source column mappingThree capacity columns include parentheses in the source headers, so the column map now renames:
To align with PUDL naming conventions, the following columns are also renamed:
Identifier naming (
|
There was a problem hiding this comment.
This is in really good shape!
A few small notes about metadata for fields, mostly focused on trimming, and one sneaky non-standard NA value.
Instead of a custom error_if expression, I reused the existing expect_sum_close_to_total_column test (already used in several EIA-923 tables).
I set max_discrepancy_rate: 0.25, slightly above the observed ~23% mismatch rate. This seemed to capture the intent of the suggested validation while reusing existing macros.
Great, I have no concerns about this.
Not at all a blocker, but you could also consider using the add_fips_ids() helper function as we have both a state and a county column (with geocodes originating from table _core_censuspep__yearly_geocodes). There will be 4 pairs of county/state names that aren't parsed correctly and could be fixed by hand:
county_name state
la salle IL 138
glacier national MT 12
skagway-yakutat- AK 12
w. carroll LA 12
Once these metadata changes are implemented, you'll be ready to write this table to DB!
| "description": ( | ||
| "Volume of base gas (cushion gas) in the underground storage reservoir, " | ||
| "as reported by the operator to EIA on Form 191. Base gas is the volume " | ||
| "of gas intended as permanent inventory in a reservoir to maintain " | ||
| "adequate pressure and deliverability rates." | ||
| ), |
There was a problem hiding this comment.
| "description": ( | |
| "Volume of base gas (cushion gas) in the underground storage reservoir, " | |
| "as reported by the operator to EIA on Form 191. Base gas is the volume " | |
| "of gas intended as permanent inventory in a reservoir to maintain " | |
| "adequate pressure and deliverability rates." | |
| ), | |
| "description": ( | |
| "Volume of base gas (cushion gas) in the underground storage reservoir." | |
| "Base gas is the volume of gas intended as permanent inventory in a reservoir " | |
| "to maintain adequate pressure and deliverability rates." | |
| ), |
Lovely clear description!
In general, we don't need to note that this was reported by the operator on form 191 in column-level descriptions for any columns in this table. This will be clear from the context of the primary key and table-level descriptions.
| "company_name": { | ||
| "type": "string", | ||
| "description": "Name of company submitting SEC 10k filing.", | ||
| "description": "Name of the reporting company.", |
There was a problem hiding this comment.
This should be clear enough to cover both use cases, good update.
src/pudl/metadata/fields.py
Outdated
| "pattern": r"^\d{5}$", | ||
| }, | ||
| }, | ||
| "county_name": { |
There was a problem hiding this comment.
There's already a county field we should use instead of creating a new one. If you want to make the description more specific, you can override the description in FIELD_METADATA_BY_RESOURCE. See line 9914 for an example of this.
src/pudl/metadata/fields.py
Outdated
| ), | ||
| "unit": "MWh", | ||
| }, | ||
| "gas_field_code": { |
There was a problem hiding this comment.
| "gas_field_code": { | |
| "gas_field_id_eia": { |
When the ID is assigned by a particular agency, we note this as follows. For us, code is typically something that is part of a coding table (eg., 1 = "active", 2 = "retired"), so in this case I think id is clearer.
src/pudl/metadata/fields.py
Outdated
| "EIA storage region in which the underground natural gas storage field " | ||
| "is located, as reported on Form 191." | ||
| ), | ||
| "constraints": { |
There was a problem hiding this comment.
This would be better as a dictionary called EIA191_STORAGE_REGIONS in enums.py.
| "One row per storage reservoir per month." | ||
| ), | ||
| "additional_source_text": ( | ||
| "EIA Form 191 (Schedule RP8 — Monthly Underground Gas Storage Report). " |
There was a problem hiding this comment.
We only use this if there's a particular part of the form that a table corresponds to (e.g., Part A). In this case, no need to provide this key at all.
| "See https://www.eia.gov/survey/form/eia_191/form.pdf." | ||
| ), | ||
| "additional_details_text": ( | ||
| "The ``total_field_capacity_mcf`` field is reported as the total design " |
There was a problem hiding this comment.
As noted elsewhere, I think this makes more sense as custom usage warning. See here for instructions on how to do so.
src/pudl/metadata/fields.py
Outdated
| "unit": "MWh", | ||
| }, | ||
| "gas_field_code": { | ||
| "type": "string", |
There was a problem hiding this comment.
This should be integer, as it looks like it is always numeric and not zero-padded in the raw data.
src/pudl/metadata/fields.py
Outdated
| "storage_field_id_eia191": { | ||
| "type": "string", | ||
| "description": ( | ||
| "EIA-assigned identifier for an underground natural gas storage reservoir, " |
There was a problem hiding this comment.
Would be useful to note here that each ID is per company and state, so one company may have multiple.
src/pudl/metadata/fields.py
Outdated
| }, | ||
| }, | ||
| "reservoir_code": { | ||
| "type": "string", |
There was a problem hiding this comment.
integer would be the observed data type here.
|
Hi @e-belfer, Thanks for the detailed feedback. I’ve pushed updates addressing the second-round review comments. Field renaming
Metadata cleanup
The table now materializes successfully in Dagster, and all pre-commit hooks and unit tests are passing. I left the optional FIPS enrichment out of this PR. The remaining county-name cleanup requires manual fixes for four records, including |
For more information, see https://pre-commit.ci
Add io_manager_key to the core_eia191__monthly_gas_storage asset and add an Alembic migration to create the table in the PUDL SQLite database. Enum constraints on field_type, status, and region match field metadata.
- Rename gas_field_code -> gas_field_id_eia (integer type) - Change reservoir_code type to integer - Replace county_name with existing county field; remove county_name from fields.py - Move storage region enum to EIA191_STORAGE_REGIONS in enums.py - Clarify storage_field_id_eia191 description (per-company-per-state) - Trim form citation from base_gas_mcf description - Remove additional_source_text from resource metadata - Move capacity additivity note to usage_warnings custom format
Context is clear from table-level metadata; field descriptions don't need to repeat the source form. Also removes the non-additivity note from total_field_capacity_mcf since it is covered by the table-level usage warning.
- Remove redundant additional_details_text from resource (storage field ID semantics already captured in field description) - Regenerate Alembic migration to pick up clean field descriptions (strip stale 'as reported on Form 191' column comments) and add explicit primary key constraint
Replace the EIA-191-specific `status` field with the existing `operational_status` field, overriding description and enum constraints via FIELD_METADATA_BY_RESOURCE for the eia191 table.
Follows agency-provenance naming convention (id not code, _eia suffix). Adds 999 → NA conversion in transform for Lyon 29 sentinel values.
7688b65 to
cc28a45
Compare
Overview
Initial Progress and checking in on #4757
What problem does this address?
EIA Form 191 monthly underground gas storage data (RP8 dataset, 2014–present) is not yet available in PUDL. This adds the transform layer as a step toward full integration.
What did you change?
core_eia191__monthly_gas_storagetransform asset insrc/pudl/transform/eia191.py:report_yearcolumnyear+monthintoreport_dateviapudl.helpers.convert_to_date()base_gas→base_gas_mcfto make the Mcf unit explicitStringDtype, integer columns toInt64Dtype(id, report_date)— verified unique across all 58,633 rowsClarifications
The notebook uncovered that the three capacity columns base_gas_mcf, working_gas_capacity_mcf, and total_field_capacity_mcf don't add up for ~23% of rows.
This is an upstream reporting artifact, not a transform error. I looked at the forms and found that:
working_gas_capacity_mcfandtotal_field_capacity_mcfare design/capacity specs from Part 3 of the form (static, public)base_gas_mcfis the one public field from Part 4 (actual operational volumes, which are otherwise confidential)The mismatch is mostly systematic: 63 reservoirs always mismatch, 269 rarely do, confirming it reflects genuine differences in how operators interpret "total capacity," not data entry errors.
One exception: LIBERTY NORTH KS (2023) appears to have swapped/misreported values.
All three columns are passed through as reported. Users should not assume total = working + base
Next Steps
define the table in
pudl/src/metadata/resources/eia191.pyadd any new fields into
pudl/src/pudl/metadata/fields.py