Skip to content

EIA191 rp8 data integration#5058

Open
irubey wants to merge 16 commits intocatalyst-cooperative:mainfrom
irubey:isaac/eia191_rp8_data_integration
Open

EIA191 rp8 data integration#5058
irubey wants to merge 16 commits intocatalyst-cooperative:mainfrom
irubey:isaac/eia191_rp8_data_integration

Conversation

@irubey
Copy link
Copy Markdown
Contributor

@irubey irubey commented Mar 6, 2026

Overview

Initial Progress and checking in on #4757

What problem does this address?

EIA Form 191 monthly underground gas storage data (RP8 dataset, 2014–present) is not yet available in PUDL. This adds the transform layer as a step toward full integration.

What did you change?

  • Add core_eia191__monthly_gas_storage transform asset in src/pudl/transform/eia191.py:
    • Drops redundant report_year column
    • Combines year + month into report_date via pudl.helpers.convert_to_date()
    • Renames base_gasbase_gas_mcf to make the Mcf unit explicit
    • Converts string columns to nullable StringDtype, integer columns to Int64Dtype
    • Primary key: (id, report_date) — verified unique across all 58,633 rows
    • Add exploration notebook justifying all transform decisions (PK choice, null handling, capacity column non-additivity, etc.)

Clarifications

The notebook uncovered that the three capacity columns base_gas_mcf, working_gas_capacity_mcf, and total_field_capacity_mcf don't add up for ~23% of rows.
This is an upstream reporting artifact, not a transform error. I looked at the forms and found that:

  • working_gas_capacity_mcf and total_field_capacity_mcf are design/capacity specs from Part 3 of the form (static, public)
  • base_gas_mcf is the one public field from Part 4 (actual operational volumes, which are otherwise confidential)

The mismatch is mostly systematic: 63 reservoirs always mismatch, 269 rarely do, confirming it reflects genuine differences in how operators interpret "total capacity," not data entry errors.

One exception: LIBERTY NORTH KS (2023) appears to have swapped/misreported values.
All three columns are passed through as reported. Users should not assume total = working + base

Next Steps

define the table in pudl/src/metadata/resources/eia191.py
add any new fields into pudl/src/pudl/metadata/fields.py

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@irubey irubey changed the title Isaac/eia191 rp8 data integration EIA191 rp8 data integration Mar 6, 2026
@e-belfer e-belfer self-requested a review March 9, 2026 18:41
@irubey
Copy link
Copy Markdown
Contributor Author

irubey commented Mar 9, 2026

Hi @e-belfer

I opened this draft PR implementing the transform layer for core_eia191__monthly_gas_storage.

Would love your feedback on the transform approach and column handling before I continue with the metadata/resource definitions.

One thing I surfaced during the investigation: about 23% of rows have total ≠ working + base. My assumption is that this reflects how the fields are reported in different sections of the EIA form rather than a transform issue, but I wanted to check whether PUDL typically enforces relationships like this or simply documents them.

@e-belfer
Copy link
Copy Markdown
Member

e-belfer commented Mar 9, 2026

@irubey Hey! I'm planning to review this tomorrow and will take a look at this issue in particular, thanks for flagging it!

Copy link
Copy Markdown
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @irubey!

This is awesome! Thanks for the detailed notebook.

I agree with your choice of primary key, and it's good to see that all the records without working_gas_capacity_mcf are inactive fields. Generally, the data looks pretty clean, with the exceptions you noted.

I tried to push a small extraction change to your PR, but seems like I don't have permission, so I've described the changes directly below.

First and foremost, we have more up to date EIA 191 data available. Let's go ahead and update the DOI in pudl/package_data/settings/zenodo_dois.yaml to be 10.5281/zenodo.18823073 to get the latest data in.

Next, you'll need to implement a small workaround to correctly rename the columns at the extraction step, in src.pudl.extract.eia191.py:

def process_raw(
        self, df: pd.DataFrame, page: str, **partition: PartitionSelection
    ) -> pd.DataFrame:
        """Rename columns using `any_year` partition."""
        return df.rename(
            columns=self._metadata.get_column_map(page=page, year="any_year")
        )

Since year and report_year are 100% identical, we can safely replace the existing process_raw function, which is currently adding the report_year column, and remove the drop step in the transform.

You'll also want to update the column names in src/pudl/package_data/eia191/column_maps/data/csv as follows:

year_index year month reservoir_state operator_id_eia gas_field_code reservoir_code company_name field_name reservoir_name field_type reservoir_county reservoir_status base_gas_mcf working_gas_capacity_mcf total_field_capacity_mcf maximum_daily_delivery_mcf storage_region
any_year year month report_state id gas_field_code reservoir_code company_name field_name reservoir_name field_type county_name status base_gas working_gas_capacity_mcf total_field_capacity_mcf maximum_daily_delivery_mcf region

This names the columns correctly at the source, rather than renaming them in the transform.

A few other suggestions:

  • Data types will be enforced when the table is written to database, and they'll get defined in field.py, so there's no need to manually coerce data types in the transform.
  • I recommend running the function simplify_strings() from pudl.helpers on all string fields other than ID and state - it'll handle cases, extra whitespaces, and non-standard characters.
  • status, region and field_type will all want to get further constrained in fields.py, using the enums field - see byproduct_description for an example of how this looks.

Regarding the total vs working vs base gas distinction - we don't generally correct without a high degree of certainty, so I would do the same as you and pass through the data. Let's make sure we keep checking this relationship by writing this as a DBT validation at the end of the transform step and document the extent of failure using:

config:
   error_if: ">X"

and leaving a note in the description. This will flag this for us to investigate further in the future, and easily flag for us if a new month of data has way more errors than are anticipated. You can also note this in the additional_details_text for the table-level metadata, and I'm happy to help wordsmith when you get there if you want a second brain.

Let me know if you have questions on any of this, otherwise ping me when you're ready for me to take a look again!

@zaneselvans zaneselvans added new-data Requests for integration of new data. eia191 Issues related to EIA Form 191: Monthly underground natural gas storage report community Issues that contributors have volunteered to take on or fostering more community labels Mar 10, 2026
@cmgosnell cmgosnell moved this from New to In progress in Catalyst Megaproject Mar 11, 2026
@irubey irubey force-pushed the isaac/eia191_rp8_data_integration branch from 46075e5 to b3229d1 Compare March 14, 2026 02:57
@irubey
Copy link
Copy Markdown
Contributor Author

irubey commented Mar 14, 2026

Hi @e-belfer,

Thanks for the detailed review and suggestions. I've pushed several commits addressing the first-checkpoint feedback and would appreciate another look.

Implemented changes

All items from the first-checkpoint feedback are now implemented:

  • Updated DOI to 10.5281/zenodo.18823073 and re-downloaded raw data
  • Moved column renaming to extraction (process_raw() in src/pudl/extract/eia191.py)
  • Updated the column map CSV to match the RP8 source headers
  • Removed manual dtype coercion (types handled in fields.py)
  • Applied simplify_strings() to string columns except state and storage_field_id_eia191
  • Added fields to fields.py with descriptions, units, and enums
  • Added resource schema core_eia191__monthly_gas_storage
  • Added dbt validation tests

A few implementation details differed slightly from the suggestions, so I wanted to confirm they look reasonable.


Source column mapping

Three capacity columns include parentheses in the source headers, so the column map now renames:

  • base_gasbase_gas_mcf
  • working_gas_capacity_(mcf)working_gas_capacity_mcf
  • total_field_capacity_(mcf)total_field_capacity_mcf
  • maximum_daily_delivery_(mcf)maximum_daily_delivery_mcf

To align with PUDL naming conventions, the following columns are also renamed:

  • report_statestate

Identifier naming (id)

Your suggested name operator_id_eia for the raw id column prompted some deeper investigation.

Across the dataset, the id value appears to identify individual storage field operations, not operators:

  • the value consists of an 8-digit prefix plus the state abbreviation
  • the same physical reservoir can appear with multiple id values over time when ownership changes
  • operators frequently have multiple id values within a state

This means the column is not equivalent to operator_id_eia as used in EIA-176.

To reflect what the identifier represents, the current proposal is:

  • idstorage_field_id_eia191

Happy to defer to your convention if there's a preferred naming pattern for this type of identifier.


Capacity validation test

Instead of a custom error_if expression, I reused the existing expect_sum_close_to_total_column test (already used in several EIA-923 tables).

I set max_discrepancy_rate: 0.25, slightly above the observed ~23% mismatch rate. This seemed to capture the intent of the suggested validation while reusing existing macros.

Please let me know if you'd prefer a raw SQL expression test instead.


Table documentation

Resource metadata has been added in src/pudl/metadata/resources/eia191.py using the standard fields (additional_summary_text, additional_source_text, additional_details_text). Please let me know if additional documentation steps are expected.


company_name field

company_name already existed in fields.py with a SEC-specific description. Since EIA-191 also reports company names, I reused the field and generalized the description to "Name of the reporting company."

Happy to adjust if there's a preferred convention for shared fields across datasets.

Copy link
Copy Markdown
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in really good shape!

A few small notes about metadata for fields, mostly focused on trimming, and one sneaky non-standard NA value.

Instead of a custom error_if expression, I reused the existing expect_sum_close_to_total_column test (already used in several EIA-923 tables).

I set max_discrepancy_rate: 0.25, slightly above the observed ~23% mismatch rate. This seemed to capture the intent of the suggested validation while reusing existing macros.

Great, I have no concerns about this.

Not at all a blocker, but you could also consider using the add_fips_ids() helper function as we have both a state and a county column (with geocodes originating from table _core_censuspep__yearly_geocodes). There will be 4 pairs of county/state names that aren't parsed correctly and could be fixed by hand:

county_name state
la salle IL 138
glacier national MT 12
skagway-yakutat- AK 12
w. carroll LA 12

Once these metadata changes are implemented, you'll be ready to write this table to DB!

Comment on lines +573 to +578
"description": (
"Volume of base gas (cushion gas) in the underground storage reservoir, "
"as reported by the operator to EIA on Form 191. Base gas is the volume "
"of gas intended as permanent inventory in a reservoir to maintain "
"adequate pressure and deliverability rates."
),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": (
"Volume of base gas (cushion gas) in the underground storage reservoir, "
"as reported by the operator to EIA on Form 191. Base gas is the volume "
"of gas intended as permanent inventory in a reservoir to maintain "
"adequate pressure and deliverability rates."
),
"description": (
"Volume of base gas (cushion gas) in the underground storage reservoir."
"Base gas is the volume of gas intended as permanent inventory in a reservoir "
"to maintain adequate pressure and deliverability rates."
),

Lovely clear description!

In general, we don't need to note that this was reported by the operator on form 191 in column-level descriptions for any columns in this table. This will be clear from the context of the primary key and table-level descriptions.

"company_name": {
"type": "string",
"description": "Name of company submitting SEC 10k filing.",
"description": "Name of the reporting company.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be clear enough to cover both use cases, good update.

"pattern": r"^\d{5}$",
},
},
"county_name": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a county field we should use instead of creating a new one. If you want to make the description more specific, you can override the description in FIELD_METADATA_BY_RESOURCE. See line 9914 for an example of this.

),
"unit": "MWh",
},
"gas_field_code": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"gas_field_code": {
"gas_field_id_eia": {

When the ID is assigned by a particular agency, we note this as follows. For us, code is typically something that is part of a coding table (eg., 1 = "active", 2 = "retired"), so in this case I think id is clearer.

"EIA storage region in which the underground natural gas storage field "
"is located, as reported on Form 191."
),
"constraints": {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as a dictionary called EIA191_STORAGE_REGIONS in enums.py.

"One row per storage reservoir per month."
),
"additional_source_text": (
"EIA Form 191 (Schedule RP8 — Monthly Underground Gas Storage Report). "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use this if there's a particular part of the form that a table corresponds to (e.g., Part A). In this case, no need to provide this key at all.

"See https://www.eia.gov/survey/form/eia_191/form.pdf."
),
"additional_details_text": (
"The ``total_field_capacity_mcf`` field is reported as the total design "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, I think this makes more sense as custom usage warning. See here for instructions on how to do so.

"unit": "MWh",
},
"gas_field_code": {
"type": "string",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be integer, as it looks like it is always numeric and not zero-padded in the raw data.

"storage_field_id_eia191": {
"type": "string",
"description": (
"EIA-assigned identifier for an underground natural gas storage reservoir, "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to note here that each ID is per company and state, so one company may have multiple.

},
},
"reservoir_code": {
"type": "string",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer would be the observed data type here.

@irubey
Copy link
Copy Markdown
Contributor Author

irubey commented Mar 24, 2026

Hi @e-belfer,

Thanks for the detailed feedback. I’ve pushed updates addressing the second-round review comments.

Field renaming

  • storage_field_id_eia191storage_field_id_eia
  • reservoir_codereservoir_id_eia (integer; 999 sentinel converted to NA in transform)
  • gas_field_codegas_field_id_eia (integer)
  • statusoperational_status (reusing the existing field with a per-resource enum override)

Metadata cleanup

  • Trimmed additional_summary_text to one sentence
  • Removed the additional_source_text key
  • Removed “as reported on Form 191” from column descriptions
  • Moved the capacity additivity note to a custom table-level usage warning
  • Moved the region enum to EIA191_STORAGE_REGIONS in enums.py
  • Reused the existing county field instead of introducing a new one

The table now materializes successfully in Dagster, and all pre-commit hooks and unit tests are passing.

I left the optional FIPS enrichment out of this PR. The remaining county-name cleanup requires manual fixes for four records, including skagway-yakutat-/AK, which appears truncated and would need confirmation against current Alaska county-equivalent geography before assigning a FIPS code. Since this is non-blocking and the core table is ready, I kept that work scoped to a follow-up.

irubey and others added 8 commits March 24, 2026 17:08
Add io_manager_key to the core_eia191__monthly_gas_storage asset and
add an Alembic migration to create the table in the PUDL SQLite database.
Enum constraints on field_type, status, and region match field metadata.
- Rename gas_field_code -> gas_field_id_eia (integer type)
- Change reservoir_code type to integer
- Replace county_name with existing county field; remove county_name from fields.py
- Move storage region enum to EIA191_STORAGE_REGIONS in enums.py
- Clarify storage_field_id_eia191 description (per-company-per-state)
- Trim form citation from base_gas_mcf description
- Remove additional_source_text from resource metadata
- Move capacity additivity note to usage_warnings custom format
Context is clear from table-level metadata; field descriptions don't need
to repeat the source form. Also removes the non-additivity note from
total_field_capacity_mcf since it is covered by the table-level usage warning.
irubey added 7 commits March 24, 2026 17:08
- Remove redundant additional_details_text from resource (storage field
  ID semantics already captured in field description)
- Regenerate Alembic migration to pick up clean field descriptions
  (strip stale 'as reported on Form 191' column comments) and add
  explicit primary key constraint
Replace the EIA-191-specific `status` field with the existing
`operational_status` field, overriding description and enum constraints
via FIELD_METADATA_BY_RESOURCE for the eia191 table.
Follows agency-provenance naming convention (id not code, _eia suffix).
Adds 999 → NA conversion in transform for Lyon 29 sentinel values.
@irubey irubey force-pushed the isaac/eia191_rp8_data_integration branch from 7688b65 to cc28a45 Compare March 24, 2026 23:10
@irubey irubey marked this pull request as ready for review March 24, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Issues that contributors have volunteered to take on or fostering more community eia191 Issues related to EIA Form 191: Monthly underground natural gas storage report new-data Requests for integration of new data.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Create core_eia191__monthly_gas_storage

4 participants