Skip to content

Show most recent data available as a component in table descriptions#4632

Open
krivard wants to merge 29 commits intomainfrom
meta2_availability
Open

Show most recent data available as a component in table descriptions#4632
krivard wants to merge 29 commits intomainfrom
meta2_availability

Conversation

@krivard
Copy link
Copy Markdown
Contributor

@krivard krivard commented Sep 24, 2025

Overview

Addresses #4586.

What did you change?

  • Add a new description component called "availability"
    • configured using key "availability_text" xor "availability_offset" (only one is allowed)
    • defaults to the most-recent partition of row counts for the table. if row counts are not partitioned, grabs the most-recent partition of the table's source. if more than one source is listed, finds the most-recent partition of each source, then takes the least-most-recent among those. if availability_offset is specified, offsets the partition we got from the source by that many partitions (forward or back).
    • if no temporal partitions are available, hides the availability component
  • Add warnings for discontinued tables
  • Add a new method to DataSource that lists all temporal partitions
  • Add some new functions in a lookup table in pudl.metadata.descriptions for doing partition arithmetic
  • Refactor Resource.dict_from_resource_descriptor to pull some useful stuff into a separate function that can be called by both Resource.dict_from_resource_descriptor and the unit tests
  • Refactor to move DBT_DIR to somewhere that doesn't import pudl.metadata.classes; pudle.workspace.setup seemed fine

Already merged to metadata-phase-2 using a rebase:

  • Refactor description machinery according to plan
    • Extract a data class from ResourceDescriptionBuilder
    • Use decorators to define components

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

  • Used the metadata wizard to examine normal tables and some weird ones
    • normals: out_eia__monthly_generators, core_epacems__hourly_emissions, out_sec10k__parents_and_subsidiaries
    • weirdos: out_censusdp1tract__states, out_gridpathratoolkit__hourly_available_capacity_factor, core_eia861__yearly_green_pricing

checking tables with year-based availability

Ran some code to check for gross mismatches. It is pretty quick:

code behind the cut
import duckdb

from pudl.metadata.classes import PudlResourceDescriptor, Resource
from pudl.metadata.descriptions import ResourceDescriptionBuilder
from pudl.metadata.resources import RESOURCE_METADATA
from pudl.workspace.setup import PudlPaths

pp = PudlPaths()

def prefer(candidates, fn):
    preferred_candidates = [c for c in candidates if fn(c)]
    if len(preferred_candidates) > 0:
        return preferred_candidates
    return candidates

def find_date_column(fields):
    candidates = [c for c in fields if (
        c.endswith("date") or c.endswith("year") or c.endswith("datetime_utc")
    )]
    if len(candidates)>1:
        candidates = prefer(candidates, lambda c: c.startswith("report"))
    if len(candidates)>1:
        candidates = prefer(candidates, lambda c: c.endswith("year"))
    if len(candidates)==1:
        return candidates[0]
    return None

for resource_id, resource_dict in sorted(RESOURCE_METADATA.items()):
    desc = ResourceDescriptionBuilder(
        resource_id=resource_id,
        settings=Resource._resolve_references_from_resource_descriptor(
            resource_id, PudlResourceDescriptor.model_validate(resource_dict)
        ),
    ).build()
    date_column = find_date_column(resource_dict["schema"]["fields"])
    if not date_column:
        print(f"no_date {resource_id}")
        continue
    try:
        resource_path = str(pp.parquet_path(resource_id))
    except FileNotFoundError as e:
        print(f"no_file {resource_id}")
        continue
    try:
        query = f"""SELECT MAX({date_column}) as max_date FROM '{resource_path}'"""
        df = duckdb.sql(query).df()
        max_year = df.max_date[0]
        if not date_column.endswith("year"):
            max_year = df.max_date.dt.year[0]
        if desc.availability.type != "True":
            print(f"no_avail {resource_id} {date_column} {max_year}")
            continue
        predicted_max_year = int(desc.availability.description[:4])
        if max_year != predicted_max_year:
            print(f"mismatch {resource_id} {date_column} {max_year} {desc.source.type} {predicted_max_year}")
    except:
        print(f"wtf {resource_id} {date_column}")
        raise

The first pass found the following, which have all been addressed using row counts:

[RESOLVED] 8 lag a year behind;
  • _core_eia923__monthly_cooling_system_information (2024 from report_date vs 2025 from eia923)
  • _core_eia923__yearly_byproduct_disposition (2024 from report_year vs 2025 from eia923)
  • _core_eia923__yearly_byproduct_expenses_and_revenues (2024 from report_year vs 2025 from eia923)
  • _core_eia923__yearly_fgd_operation_maintenance (2024 from report_date vs 2025 from eia923)
  • out_eia923__yearly_boiler_fuel (2024 from report_date vs 2025 from eia923)
  • out_eia923__yearly_fuel_receipts_costs (2024 from report_date vs 2025 from eia923)
  • out_eia923__yearly_generation (2024 from report_date vs 2025 from eia923)
  • out_eia923__yearly_generation_fuel_combined (2024 from report_date vs 2025 from eia923)
[RESOLVED] 15 are a year ahead;
  • core_eia860__scd_boilers (2025 from report_date vs 2024 from eia860)
  • core_eia860__scd_generators (2025 from report_date vs 2024 from eia860)
  • core_eia860__scd_plants (2025 from report_date vs 2024 from eia860)
  • core_eia860__scd_utilities (2025 from report_date vs 2024 from eia860)
  • core_eia860m__changelog_generators (2025 from report_date vs 2024 from eia860m)
  • out_eia__monthly_generators (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_assn_plant_parts_plant_gen (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_boilers (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_generators (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_generators_by_ownership (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_plant_parts (2025 from report_year vs 2024 from eia)
  • out_eia__yearly_plants (2025 from report_date vs 2024 from eia)
  • out_eia__yearly_utilities (2025 from report_date vs 2024 from eia)
  • out_ferc714__georeferenced_respondents (2024 from report_date vs 2023 from ferc714)
  • out_ferc714__respondents_with_fips (2024 from report_date vs 2023 from ferc714)

tables with subannual availability

[RESOLVED] All the tables below have temporal partitions in row counts, and no longer show up with subannual availability.

  • CEMS is the only source with quarters. The only CEMS table we publish also draws from eia860, which only tracks annual availability. Because we base availability off of the least recent among the available sources, availability for core_epacems__hourly_emissions will only ever report the annual figure from eia860. This means it was checked adequately in the previous section ✅ .
  • 7 eia930 tables and 1 eia code table use half-years. here are the predicted availability and max date for each table:
    • ✅ core_eia930__hourly_interchange 2025half2 2025-10-31 07:00:00
    • ✅ core_eia930__hourly_net_generation_by_energy_source 2025half2 2025-11-02 07:00:00
    • ✅ core_eia930__hourly_operations 2025half2 2025-11-02 07:00:00
    • ✅ core_eia930__hourly_subregion_demand 2025half2 2025-11-01 07:00:00
    • ✅ no date columns in core_eia__codes_balancing_authority_subregions (though we still display availability for it)
    • ✅ out_eia930__hourly_aggregated_demand 2025half2 2025-11-02 07:00:00
    • ✅ out_eia930__hourly_operations 2025half2 2025-11-02 07:00:00
    • ✅ out_eia930__hourly_subregion_demand 2025half2 2025-11-01 07:00:00

To-do list

  • Check CEMS and other tables with subannual releases, since the above code probably doesn't catch them
  • Review the PR yourself and call out any questions or issues you have.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.

Projects

Status: Blocked

Development

Successfully merging this pull request may close these issues.

3 participants