Skip to content

WIP: Add FERC Form 1 identification table to PUDL#5008

Draft
e-belfer wants to merge 7 commits intomainfrom
add-eia-ferc-utility-address-match
Draft

WIP: Add FERC Form 1 identification table to PUDL#5008
e-belfer wants to merge 7 commits intomainfrom
add-eia-ferc-utility-address-match

Conversation

@e-belfer
Copy link
Copy Markdown
Member

@e-belfer e-belfer commented Feb 11, 2026

Overview

Addresses first step in #5150

What problem does this address?

In #4975, we updated the utility matching method to auto-match utilities when they hear identical (cleaned) utility names. We want to extend this method to consider utility addresses in matching.

The first step to doing this will be to bring in the FERC Form 1 identification table, which includes data on utility addresses necessary for matching.

What did you change?

  • Add FERC Form 1 identification table as core_ferc1__yearly_identification_and_certification
  • Determine method for pre-processing of addresses on FERC side

Todos

  • Once transform approved, write to DB and add metadata
  • Add DBT tests

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

  • If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
  • Run pixi run pre-commit-run to run linters and static code analysis checks.
  • Run pixi run pytest-ci locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For PRs that change the PUDL outputs significantly, run the full ETL locally and then run the data validations using dbt. If you can't run the ETL locally then run the build-deploy-pudl GitHub Action manually and ensure that it succeeds.

@e-belfer e-belfer self-assigned this Feb 11, 2026
@e-belfer e-belfer added glue PUDL specific structures & metadata. Stuff that connects datasets together. eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1 eia860 Anything having to do with EIA Form 860 labels Feb 11, 2026
@e-belfer e-belfer added the record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. label Feb 11, 2026
@e-belfer e-belfer force-pushed the add-eia-ferc-utility-address-match branch from 5b0c503 to 5ed40cc Compare February 11, 2026 16:58
@jdangerx jdangerx moved this from New to In progress in Catalyst Megaproject Feb 11, 2026
@e-belfer e-belfer requested a review from katie-lamb March 26, 2026 16:32
@e-belfer e-belfer changed the title WIP: Add FERC Form 1 identification table and add addresses to FERC-EIA utility match WIP: Add FERC Form 1 identification table to PUDL Mar 26, 2026
@e-belfer e-belfer linked an issue Mar 31, 2026 that may be closed by this pull request
6 tasks
@e-belfer e-belfer requested a review from cmgosnell April 1, 2026 15:48
Comment on lines +3157 to +3162
df[
["office_street_address", "office_city", "office_state", "office_zip_code"]
] = pd.DataFrame(
df["office_street_address"].apply(parse_address).tolist(),
index=df.index,
)
Copy link
Copy Markdown
Member

@katie-lamb katie-lamb Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few suggested cleaning steps:

  • it might be nice to make them lowercase or title case, but not a big deal. will be done in splink/during match anyways
  • expand address suffixes to their full name and stripping punctuation, or vice versa. e.g. st. -> street. this could also happen within a separate cleaning in the match step. i believe i do this for the SEC addresses and can dig it up

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking these cleaning steps also depends on how useful address actually is for the match. maybe zip or state is really more useful

Comment on lines +2612 to +2614
parsed.get("PlaceName", None),
parsed.get("StateName", None),
parsed.get("ZipCode", None),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we can maybe do the match without state and zip, but it's probably nice to have those columns in this table anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 ferc1 Anything having to do with FERC Form 1 glue PUDL specific structures & metadata. Stuff that connects datasets together. record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Match FERC and EIA utilities using splink

3 participants