diff --git a/docs/api/countries.rst b/docs/api/countries.rst index 1c2b6fc8..c2ef6739 100644 --- a/docs/api/countries.rst +++ b/docs/api/countries.rst @@ -6,7 +6,7 @@ A common list of countries ========================== Having an agreed list of country names including a mapping to alpha-3 and alpha-2 codes -(also know as ISO3 and ISO2 codes) is an important prerequisite for scenario analysis +(also known as ISO3 and ISO2 codes) is an important prerequisite for scenario analysis and model comparison. The :class:`nomenclature` package builds on the :class:`pycountry` package diff --git a/docs/api/nuts.rst b/docs/api/nuts.rst index 3ff47d8e..0e0019be 100644 --- a/docs/api/nuts.rst +++ b/docs/api/nuts.rst @@ -23,8 +23,92 @@ The full list of NUTS regions is accessible via the Eurostat website (`xlsx, 500 from nomenclature import nuts - # list of NUTS region codes - nuts.codes + # Access NUTS region information + nuts.codes # List of all NUTS codes + nuts.names # List of all NUTS region names - # list of NUTS region names - nuts.names + # Query specific NUTS levels + nuts.get(level=3) # Get all NUTS3 regions + + # Query by country + nuts.get(country_code="AT") # Get all NUTS regions in Austria + +.. currentmodule:: nomenclature.processor.nuts + + +**NutsProcessor** +----------------- + +The :class:`NutsProcessor` class provides automated aggregation of scenario data +across NUTS regions. It performs hierarchical aggregation in the following order: + +1. NUTS3 → NUTS2 +2. NUTS2 → NUTS1 +3. NUTS1 → Country +4. Country → European Union (if ≥ 23 of the 27 EU member states are present) +5. Country + UK → European Union and United Kingdom (if the United Kingdom is also present) + +The EU-level aggregations (steps 4-5) are only performed if the corresponding +target regions (``European Union`` and ``European Union and United Kingdom``) are +defined in the project's region codelist. If fewer than 23 EU member states are +present in the data, the EU aggregation is skipped silently. + +The processor ensures that regional data is consistently aggregated and validated +according to the configured NUTS regions and variable code lists. + +Consider the example below for configuring a project using NUTS aggregation. +The *nomenclature.yaml* in the project directory is as follows: + +.. code:: yaml + + dimensions: + - region + - variable + definitions: + region: + nuts:nuts: + nuts-1: [ AT ] + nuts-2: [ AT ] + nuts-3: [ AT ] + country: true + processors: + nuts: [ Model A ] + +With this configuration, calling :func:`process` will automatically instantiate +and apply the :class:`NutsProcessor`. + +.. code:: python + + import pyam + from nomenclature import DataStructureDefinition, process + + df = pyam.IamDataFrame(data="path/to/file.csv") + dsd = DataStructureDefinition("definitions") + aggregated_data = process(df, dsd) + +The data is aggregated for the applicable variables, creating the common region +``Austria`` (AT) from its constituent NUTS subregions. +The country-level regions must be defined in a region definition file or by setting +*definitions.region.country* as *true* in the configuration file +(see :ref:`adding-countries`). + +.. note:: + + Only NUTS regions explicitly listed under ``definitions.region.nuts`` are present + in the output. The :class:`NutsProcessor` always aggregates through all levels, + but intermediate levels are **dropped** from the result unless they are listed + in the configuration. In the example above, all three levels (NUTS1, NUTS2, NUTS3) + are listed, so the final output includes the original NUTS3 data as well as + the aggregated NUTS2 and NUTS1 regions alongside the country-level result. + If only ``nuts-3`` were listed, the aggregated NUTS2 and NUTS1 regions would + be discarded and only the NUTS3 regions and the country total would be retained. + +.. note:: + + Only models listed under ``processors.nuts`` in *nomenclature.yaml* are processed + by :class:`NutsProcessor`. Data for other models is passed through unchanged. + If a NUTS region appears in the data for a listed model but the corresponding + country is missing from ``definitions.region.nuts``, a ``ValueError`` is raised. + +.. autoclass:: NutsProcessor + :members: from_definition, apply diff --git a/docs/user_guide/config.rst b/docs/user_guide/config.rst index 65ef0634..645b63a9 100644 --- a/docs/user_guide/config.rst +++ b/docs/user_guide/config.rst @@ -114,6 +114,8 @@ the nomenclature package will add all countries to the *region* codelist. More details on the list of countries can be found here: :ref:`countries`. +.. _adding-countries: + Adding NUTS to the region codelist ---------------------------------- @@ -174,3 +176,42 @@ the filtering for definitions. The above example retrieves only the model mapping for *MESSAGEix-GLOBIOM 2.1-M-R12* from the common-definitions repository. + +Configuring processors +---------------------- + +The ``processors`` section of *nomenclature.yaml* allows processors to be declared +directly in the configuration file, so they are applied automatically when calling +:func:`process` without passing an explicit ``processor`` argument. + +Region processor +^^^^^^^^^^^^^^^^ + +Setting *processors.region-processor* as *true* will automatically create a +:class:`RegionProcessor` from the project's default ``mappings/`` directory: + +.. code:: yaml + + processors: + region-processor: true + +This is equivalent to calling: + +.. code:: python + + from nomenclature.processor import RegionProcessor + processor = RegionProcessor.from_directory("mappings", dsd) + +NUTS processor +^^^^^^^^^^^^^^ + +Setting *processors.nuts* to a list of model names will automatically create a +:class:`NutsProcessor` and apply NUTS hierarchical aggregation (NUTS3 → NUTS2 → +NUTS1 → Country → EU27) for those models: + +.. code:: yaml + + processors: + nuts: [ Model A, Model B ] + +More details on NUTS aggregation can be found here: :ref:`nuts`. diff --git a/nomenclature/__init__.py b/nomenclature/__init__.py index a0b99f86..f3e56a0a 100644 --- a/nomenclature/__init__.py +++ b/nomenclature/__init__.py @@ -14,6 +14,7 @@ from nomenclature.nuts import nuts # noqa from nomenclature.processor import ( # noqa RegionAggregationMapping, # noqa + NutsProcessor, RegionProcessor, RequiredDataValidator, ) diff --git a/nomenclature/codelist.py b/nomenclature/codelist.py index e9590c09..b193310c 100644 --- a/nomenclature/codelist.py +++ b/nomenclature/codelist.py @@ -537,8 +537,12 @@ def matches_filter(code, filters, keep): def check_attribute_match(code_value, filter_value): # if is list -> recursive # if is str -> escape all special characters except "*" and use a regex + # if is bool -> match exactly (must be checked before int since bool + # is a subclass of int) # if is int -> match exactly # if is None -> Attribute does not exist therefore does not match + if isinstance(filter_value, bool): + return code_value == filter_value if isinstance(filter_value, int): return code_value == filter_value if isinstance(filter_value, str): @@ -592,6 +596,17 @@ class VariableCodeList(CodeList): unknown_code_error: ClassVar[type[UnknownCodeError]] = UnknownVariableError _data_validator = None + _region_aggregation_variables = None + + @property + def region_aggregation_variables(self) -> list[str]: + """Variable names where skip_region_aggregation is False, cached on first access.""" + if self._region_aggregation_variables is not None: + return self._region_aggregation_variables + self._region_aggregation_variables = [ + var.name for var in self.mapping.values() if not var.skip_region_aggregation + ] + return self._region_aggregation_variables @property def data_validator(self): @@ -812,6 +827,7 @@ def from_directory( RegionCode( name=r.code, hierarchy=f"NUTS {level[-1]} regions (2024 edition)", + extra_attributes={"nuts": True}, ) ) @@ -937,5 +953,4 @@ class MetaCodeList(CodeList): class ScenarioCodeList(CodeList): - unknown_code_error = UnknownScenarioError diff --git a/nomenclature/config.py b/nomenclature/config.py index 68af1285..ee6cde2e 100644 --- a/nomenclature/config.py +++ b/nomenclature/config.py @@ -27,12 +27,24 @@ class CodeListFromRepository(BaseModel): + """ + Configuration for a codelist from an external repository. + + The `include` and `exclude` filters allow selecting which definitions to import. + """ + name: str include: list[dict[str, Any]] = [{"name": "*"}] exclude: list[dict[str, Any]] = Field(default_factory=list) class CodeListConfig(BaseModel): + """Configuration for a dimension's codelist. + + This class lists external repositories for codelists, importing definitions + from remote sources. + """ + dimension: str | None = None repositories: list[CodeListFromRepository] = Field( default_factory=list, alias="repository" @@ -60,6 +72,13 @@ def repository_dimension_path(self) -> str: class RegionCodeListConfig(CodeListConfig): + """ + Configuration for a region codelist. + + This class allows selecting which regions to import from external repositories + and importing the definitions for ISO3 countries and NUTS regions. + """ + country: bool = False nuts: dict[str, str | list[str] | bool] | None = None @@ -77,11 +96,12 @@ def check_nuts( class Repository(BaseModel): + """Configuration for an external codelist repository.""" + url: str hash: str | None = None release: str | None = None local_path: Path | None = Field(default=None, validate_default=True) - # defined via the `repository` name in the configuration @model_validator(mode="after") @classmethod @@ -150,21 +170,22 @@ def check_external_repo_double_stacking(self): class DataStructureConfig(BaseModel): - """A class for configuration of a DataStructureDefinition + """ + Configuration class for the data structure definition. - Attributes - ---------- - region : RegionCodeListConfig - Attributes for configuring the RegionCodeList + This class defines the configuration for the main IAMC dimensions: + - scenario + - region + - variable + Each dimension can be configured with its own code list and repository sources. """ - model: CodeListConfig = Field(default_factory=CodeListConfig) scenario: CodeListConfig = Field(default_factory=CodeListConfig) region: RegionCodeListConfig = Field(default_factory=RegionCodeListConfig) variable: CodeListConfig = Field(default_factory=CodeListConfig) - @field_validator("model", "scenario", "region", "variable", mode="before") + @field_validator("scenario", "region", "variable", mode="before") @classmethod def add_dimension(cls, v, info: ValidationInfo): return {"dimension": info.field_name, **v} @@ -173,12 +194,14 @@ def add_dimension(cls, v, info: ValidationInfo): def repos(self) -> dict[str, str]: return { dimension: getattr(self, dimension).repositories - for dimension in ("model", "scenario", "region", "variable") + for dimension in ("scenario", "region", "variable") if getattr(self, dimension).repositories } class MappingRepository(BaseModel): + """Configuration for a mapping repository.""" + name: str include: list[str] = ["*"] @@ -196,6 +219,8 @@ def match_models(self, models: list[str]) -> list[str]: class RegionMappingConfig(BaseModel): + """Configuration for region mapping/aggregation external repositories.""" + repositories: list[MappingRepository] = Field( default_factory=list, alias="repository" ) @@ -217,7 +242,20 @@ def convert_to_set_of_repos(cls, v): return v +class ProcessorConfig(BaseModel): + """Configuration for region processor settings.""" + + nuts: list[str] | None = None + region_processor: bool = Field(False, alias="region-processor") + + model_config = ConfigDict( + validate_by_name=True, validate_by_alias=True, extra="forbid" + ) + + class TimeDomainConfig(BaseModel): + """Configuration for time domain validation settings.""" + year_allowed: bool = Field(default=True, alias="year") datetime_allowed: bool = Field(default=False, alias="datetime") timezone: str | None = Field( @@ -305,6 +343,9 @@ class NomenclatureConfig(BaseModel): repositories: dict[str, Repository] = Field(default_factory=dict) definitions: DataStructureConfig = Field(default_factory=DataStructureConfig) mappings: RegionMappingConfig = Field(default_factory=RegionMappingConfig) + processor: ProcessorConfig = Field( + default_factory=ProcessorConfig, alias="processors" + ) illegal_characters: list[str] = Field( default=[":", ";", '"'], alias="illegal-characters" ) @@ -326,6 +367,7 @@ def check_illegal_chars(cls, v: str | list[str]) -> list[str]: def check_definitions_repository( cls, v: "NomenclatureConfig" ) -> "NomenclatureConfig": + """Check that all repositories referenced in definitions and mappings exist.""" mapping_repos = {"mappings": v.mappings.repositories} if v.mappings else {} repos: dict[str, list[MappingRepository]] = { **v.definitions.repos, @@ -337,6 +379,16 @@ def check_definitions_repository( raise ValueError((f"Unknown repository {unknown_repos} in '{use}'.")) return v + @model_validator(mode="after") + @classmethod + def check_nuts_consistency(cls, v: "NomenclatureConfig") -> "NomenclatureConfig": + if v.processor.nuts and not v.definitions.region.nuts: + raise ValueError( + "`nuts` region processor set but no NUTS regions in `definitions`. " + "To fix, set `definitions.regions.nuts` to True." + ) + return v + def fetch_repos(self, target_folder: Path): for repo_name, repo in self.repositories.items(): repo.fetch_repo(target_folder / repo_name) diff --git a/nomenclature/core.py b/nomenclature/core.py index 9686e058..c49cf118 100644 --- a/nomenclature/core.py +++ b/nomenclature/core.py @@ -5,6 +5,7 @@ from nomenclature.definition import DataStructureDefinition from nomenclature.processor import Processor, RegionProcessor +from nomenclature.processor.nuts import NutsProcessor logger = logging.getLogger(__name__) @@ -21,11 +22,13 @@ def process( This function is the recommended way of using the nomenclature package. It performs the following operations: - * Validation against the codelists and criteria of a DataStructureDefinition - * Region-processing, which can consist of three parts: - 1. Model native regions not listed in the model mapping will be dropped - 2. Model native regions can be renamed - 3. Aggregation from model native regions to "common regions" + * Validation against the codelists and criteria of a :class:`DataStructureDefinition` + * Region processing, which can occur via one or more :class:`Processor` instances. This can be: + * Region aggregation (via :class:`RegionProcessor`), which renames and aggregates based on user-provided mappings. + 1. Model native regions not listed in the model mapping will be dropped + 2. Model native regions can be renamed + 3. Aggregation from model native regions to "common regions" + * NUTS aggregation (via :class:`NutsProcessor`), which aggregates NUTS3 -> NUTS2 -> NUTS1 -> Country -> EU27(+UK) * Validation of consistency across the variable hierarchy Parameters @@ -36,9 +39,9 @@ def process( Codelists that are used for validation. dimensions : list, optional Dimensions to be used in the validation, defaults to all dimensions defined in - `dsd` - processor : :class:`RegionProcessor`, optional - Region processor to perform region renaming and aggregation (if given) + ``dsd``. + processor : :class:`Processor` or list of :class:`Processor`, optional + One or more processors to apply. Runs before any config-declared processors. Returns ------- @@ -56,8 +59,30 @@ def process( dimensions = dimensions or dsd.dimensions + # Auto-instantiate processors declared in nomenclature.yaml under 'processors' + # Raise error if both explicit and config-based processors exist. + if getattr(dsd.config.processor, "region_processor", False): + if any(isinstance(p, RegionProcessor) for p in processor): + raise ValueError( + "Config declares 'region-processor: true' but an explicit " + "RegionProcessor was provided. Please specify only one source of " + "RegionProcessor (either via config or explicitly)." + ) + processor.append( + RegionProcessor.from_directory(dsd.project_folder / "mappings", dsd) + ) + + if getattr(dsd.config.processor, "nuts", None) is not None: + if any(isinstance(p, NutsProcessor) for p in processor): + raise ValueError( + "Config declares 'nuts' processor but an explicit NutsProcessor " + "was provided. Please specify only one source of NutsProcessor " + "(either via config or explicitly)." + ) + processor.append(NutsProcessor.from_definition(dsd)) + if ( - any(isinstance(p, RegionProcessor) for p in processor) + any(isinstance(p, (RegionProcessor, NutsProcessor)) for p in processor) and "region" in dimensions ): dimensions.remove("region") diff --git a/nomenclature/processor/__init__.py b/nomenclature/processor/__init__.py index 64962735..291b7c53 100644 --- a/nomenclature/processor/__init__.py +++ b/nomenclature/processor/__init__.py @@ -3,6 +3,7 @@ RegionAggregationMapping, RegionProcessor, ) +from nomenclature.processor.nuts import NutsProcessor # noqa from nomenclature.processor.required_data import RequiredDataValidator # noqa from nomenclature.processor.data_validator import DataValidator # noqa from nomenclature.processor.aggregator import Aggregator # noqa diff --git a/nomenclature/processor/nuts.py b/nomenclature/processor/nuts.py new file mode 100644 index 00000000..1bfb3443 --- /dev/null +++ b/nomenclature/processor/nuts.py @@ -0,0 +1,357 @@ +import logging +import re +import pyam +import pandas as pd + +from collections import defaultdict +from pathlib import Path +from pyam import IamDataFrame +from pyam.utils import adjust_log_level +from pydantic import ConfigDict + +from nomenclature.codelist import VariableCodeList, RegionCodeList +from nomenclature.definition import DataStructureDefinition +from nomenclature.processor import Processor +from nomenclature.processor.region import ( + aggregate_region_with_variable_rules, + merge_with_preaggregated_data, +) +from nomenclature.exceptions import UnknownRegionError +from nomenclature.countries import countries +from nomenclature.nuts import nuts + +logger = logging.getLogger(__name__) + +here = Path(__file__).parent.absolute() + +# EU27 member states alpha-2 codes (ISO 3166-1), membership as of 2026 +EU27_ALPHA2: frozenset[str] = frozenset( + { + "AT", # Austria + "BE", # Belgium + "BG", # Bulgaria + "CY", # Cyprus + "CZ", # Czechia + "DE", # Germany + "DK", # Denmark + "EE", # Estonia + "ES", # Spain + "FI", # Finland + "FR", # France + "GR", # Greece + "HR", # Croatia + "HU", # Hungary + "IE", # Ireland + "IT", # Italy + "LT", # Lithuania + "LU", # Luxembourg + "LV", # Latvia + "MT", # Malta + "NL", # Netherlands + "PL", # Poland + "PT", # Portugal + "RO", # Romania + "SE", # Sweden + "SI", # Slovenia + "SK", # Slovakia + } +) +# Minimum number of EU27 member countries required to aggregate to "European Union" +EU27_MIN_COUNTRIES: int = 23 +# UK alpha-2 code for "European Union and United Kingdom" aggregation +UK_ALPHA2: str = "UK" + + +class NutsProcessor(Processor): + """NUTS region aggregation mappings for scenario processing""" + + variable_codelist: VariableCodeList + region_codelist: RegionCodeList + models: list[str] + + model_config = ConfigDict(hide_input_in_errors=True) + + @classmethod + def from_definition( + cls, dsd: DataStructureDefinition, models: list[str] | None = None + ): + """Instantiate from a :class:`DataStructureDefinition`. + + Parameters + ---------- + dsd : DataStructureDefinition + Project data structure definition. + models : list[str], optional + Models to apply NUTS aggregation to. Defaults to the list configured + under ``config.processor.nuts`` in *dsd*. + + Raises + ------ + ValueError + If no models are configured for NUTS processing. + """ + models = models or dsd.config.processor.nuts + if not models: + raise ValueError("No models configured for NUTS processor") + return cls( + variable_codelist=dsd.variable, region_codelist=dsd.region, models=models + ) + + @property + def nuts_codelist(self): + return RegionCodeList( + name="NUTS", + mapping={ + code.name: code + for code in self.region_codelist.mapping.values() + if re.search(r"NUTS \d regions \(2024 edition\)", code.hierarchy) + }, + ) + + def apply(self, df: IamDataFrame): + """Apply NUTS region aggregation. + + Parameters + ---------- + df : IamDataFrame + Input data to be aggregated. + + Returns + ------- + IamDataFrame + Aggregated data. + + Raises + ------ + ValueError + If a NUTS region in *df* is not listed in ``definitions.region.nuts``. + UnknownRegionError + If the result contains regions not defined in the region codelist. + """ + processed_dfs: list[IamDataFrame] = [] + + # Check for NUTS regions not listed in the configuration + all_nuts = {r.code for r in nuts.get(level=[1, 2, 3])} + if unaccounted_nuts := self.nuts_codelist.validate_df( + df.filter(region=all_nuts), "region" + ): + raise ValueError( + f"Did not find NUTS region(s) {unaccounted_nuts} in 'region.nuts' configuration." + ) + + for model in df.model: + model_df = df.filter(model=model) + + # Skip unlisted models + if model not in self.models: + logger.info( + f"Skipping NUTS region aggregation for model '{model}' (no NUTS aggregation mapping)" + ) + processed_dfs.append(model_df) + else: + logger.info(f"Applying NUTS processing for model '{model}'") + processed_dfs.append(self._apply_nuts_processing(model_df)[0]) + + res = pyam.concat(processed_dfs) + if not_defined_regions := self.region_codelist.validate_items(res.region): + raise UnknownRegionError(not_defined_regions) + + return res + + def _aggregate_nuts_level( + self, + model_df: IamDataFrame, + source_regions: list[str], + parent_prefix_length: int, + ) -> IamDataFrame: + """Aggregate source NUTS regions to their parent region. + + Parameters + ---------- + model_df : IamDataFrame + Input data + source_regions : list[str] + List of NUTS region codes to aggregate + parent_prefix_length : int + Length of parent region code (4 for NUTS2, 3 for NUTS1, 2 for country) + + Returns + ------- + IamDataFrame + Aggregated data + """ + + aggregated_data = [] + + # Group by parent region + parent_groups = defaultdict(list) + for source_region in source_regions: + parent = source_region[:parent_prefix_length] + parent_groups[parent].append(source_region) + + # Aggregate each parent from its constituents + for parent_code, constituents in parent_groups.items(): + parent = ( + parent_code + if len(parent_code) > 2 # If NUTS 1 > country, use name + else countries.get(alpha_2=parent_code).name + ) + aggregated = aggregate_region_with_variable_rules( + model_df, + parent, + constituents, + self.variable_codelist, + ) + aggregated_data.extend(aggregated) + + return IamDataFrame(pd.concat(aggregated_data), meta=model_df.meta) + + def _aggregate_to_eu27(self, df: IamDataFrame) -> list[pd.Series]: + """Aggregate country-level data to European Union (and United Kingdom). + + Aggregation is performed if at least 23 of the 27 EU member states + are present in `df`. + Aggregation to EU27+UK is additionally performed if the United Kingdom + is also present. + + Both aggregations are **only** attempted if the target region is defined in + the project's region codelist. If either target is not defined, the + corresponding aggregation is silently skipped. + + Parameters + ---------- + df : IamDataFrame + Country-level data (after NUTS aggregation). + + Returns + ------- + list[pd.Series] + Aggregated EU data series (empty if threshold or codelist conditions are + not met). + """ + eu27_names = {countries.get(alpha_2=alpha2).name for alpha2 in EU27_ALPHA2} + uk_name = countries.get(alpha_2=UK_ALPHA2).name + + available_eu27 = eu27_names & set(df.region) + result: list[pd.Series] = [] + + if len(available_eu27) < EU27_MIN_COUNTRIES: + return result + + if "European Union" in self.region_codelist.mapping: + logger.info( + f"Aggregating {len(available_eu27)} EU27 member countries " + "to 'European Union'" + ) + result.extend( + aggregate_region_with_variable_rules( + df, + "European Union", + sorted(available_eu27), + self.variable_codelist, + ) + ) + + if ( + "European Union and United Kingdom" in self.region_codelist.mapping + and uk_name in set(df.region) + ): + logger.info( + "Aggregating EU27 countries + United Kingdom to 'European Union and United Kingdom'" + ) + result.extend( + aggregate_region_with_variable_rules( + df, + "European Union and United Kingdom", + sorted(available_eu27) + [uk_name], + self.variable_codelist, + ) + ) + + return result + + def _apply_nuts_processing( + self, + model_df: IamDataFrame, + return_aggregation_difference: bool = False, + rtol_difference: float = 0.01, + ): + """Apply the full NUTS aggregation pipeline for a single model. + + Parameters + ---------- + model_df : IamDataFrame + Data for a single model. + return_aggregation_difference : bool, optional + Whether to return aggregation differences for diagnostics. + rtol_difference : float, optional + Relative tolerance used when comparing pre-aggregated country data + against freshly aggregated values. + + Returns + ------- + tuple[IamDataFrame, Any] + Processed data and (optionally populated) aggregation difference. + """ + model = model_df.model[0] + + _df = model_df.copy() + + # Silence pyam's empty filter warnings + with adjust_log_level(logger="pyam", level="ERROR"): + # NUTS3 > NUTS2 aggregation + if nuts3_in_data := _df.filter(region={r.code for r in nuts.get(level=3)}): + # Keep NUTS3, add aggregated NUTS2 + _df = pyam.concat( + [_df, self._aggregate_nuts_level(_df, nuts3_in_data.region, 4)] + ) + + # NUTS2 > NUTS1 aggregation (uses original NUTS2 + aggregated NUTS2) + if nuts2_in_data := _df.filter(region={r.code for r in nuts.get(level=2)}): + # Keep NUTS2, add aggregated NUTS1 + _df = pyam.concat( + [_df, self._aggregate_nuts_level(_df, nuts2_in_data.region, 3)] + ) + + # NUTS1 > Country aggregation (uses original NUTS1 + aggregated NUTS1) + if nuts1_in_data := _df.filter(region={r.code for r in nuts.get(level=1)}): + _nuts1_agg = self._aggregate_nuts_level(_df, nuts1_in_data.region, 2) + + # Compare & merge country-level aggregated data with any pre-aggregated + # country data in the original model input + _data, difference = merge_with_preaggregated_data( + model_df, + [_nuts1_agg._data] if nuts1_in_data else [], + countries.names, + self.variable_codelist, + rtol_difference, + return_aggregation_difference, + model, + ) + + # EU27(+UK) aggregation from country-level data + _country_df = IamDataFrame(_data, meta=model_df.meta) + if eu_target_regions := [ + r + for r in ("European Union", "European Union and United Kingdom") + if r in self.region_codelist.mapping + ]: + _eu_aggregated = self._aggregate_to_eu27(_country_df) + if _eu_aggregated: + _eu_data, _ = merge_with_preaggregated_data( + model_df, + _eu_aggregated, + eu_target_regions, + self.variable_codelist, + rtol_difference, + return_aggregation_difference, + model, + ) + _data = pd.concat([_data, _eu_data]) + + # Include all NUTS regions (source + intermediate aggregated levels) + # that are present in the configured nuts_codelist + if nuts_to_keep := set(_df.region) & set(self.nuts_codelist.mapping): + _data = pd.concat([_data, _df.filter(region=list(nuts_to_keep))._data]) + + return IamDataFrame(_data, meta=model_df.meta), difference diff --git a/nomenclature/processor/region.py b/nomenclature/processor/region.py index 762a9a82..10801492 100644 --- a/nomenclature/processor/region.py +++ b/nomenclature/processor/region.py @@ -130,9 +130,11 @@ def convert_to_list(cls, v): @field_validator("native_regions") @classmethod def validate_native_regions_name(cls, v, info: ValidationInfo): - """Checks if a native region occurs a maximum of two ways: - * at most once in both keep name AND rename format - * only once in either keep name OR rename format""" + """ + Validate that each native region source name must appear *at most* once as: + - A region without renaming (keep original name), and/or + - A region with renaming (assign new name) + """ keep = [nr.name for nr in v if nr.rename is None] rename = [nr.name for nr in v if nr.rename is not None] keep_dups = [item for item, count in Counter(keep).items() if count > 1] @@ -151,6 +153,7 @@ def validate_native_regions_name(cls, v, info: ValidationInfo): @field_validator("native_regions") @classmethod def validate_native_regions_target(cls, v, info: ValidationInfo): + """Check that target region names (after renaming, if applicable) are unique.""" target_names = [nr.target_native_region for nr in v] duplicates = [ item for item, count in Counter(target_names).items() if count > 1 @@ -370,7 +373,7 @@ def from_excel(cls, file) -> "RegionAggregationMapping": 0, CommonRegion( name="World", constituent_regions=constituent_world_regions - ) + ), ) except Exception as error: raise ValueError(f"{error} in {get_relative_path(file)}") from error @@ -427,9 +430,9 @@ def check_unexpected_regions(self, df: IamDataFrame) -> None: self.model_native_region_names + self.common_region_names + [ - const_reg - for comm_reg in self.common_regions or [] - for const_reg in comm_reg.constituent_regions + constituent_region + for common_region in self.common_regions or [] + for constituent_region in common_region.constituent_regions ] + (self.exclude_regions or []) ): @@ -492,6 +495,7 @@ class RegionProcessor(Processor): str, Annotated[RegionAggregationMapping, AfterValidator(validate_with_definition)], ] + model_config = ConfigDict(hide_input_in_errors=True) @classmethod @@ -660,21 +664,21 @@ def _apply_region_processing( return_aggregation_difference: bool = False, rtol_difference: float = 0.01, ) -> tuple[IamDataFrame, pd.DataFrame]: - """Apply the region processing for a single model""" + """Apply region processing for a single model""" if len(model_df.model) != 1: raise ValueError( f"Must be called for a unique model, found: {model_df.model}" ) model = model_df.model[0] - # check for regions not mentioned in the model mapping self.mappings[model].check_unexpected_regions(model_df) _processed_data: list[pd.Series] = [] - # silence pyam's empty filter warnings + # Silence pyam's empty filter warnings with adjust_log_level(logger="pyam", level="ERROR"): - # add unchanged native regions to processed data + # Native region handling + # Unchanged regions are added to processed data directly keep = [ r.name for r in self.mappings[model].native_regions if r.rename is None ] @@ -682,7 +686,7 @@ def _apply_region_processing( if not keep_df.empty: _processed_data.append(keep_df._data) - # add renamed native regions to processed data + # Renamed regions are added to processed data rename = [ r.name for r in self.mappings[model].native_regions @@ -694,90 +698,38 @@ def _apply_region_processing( rename_df.rename(region=self.mappings[model].rename_mapping)._data ) - # aggregate common regions + # Aggregation for common_region in self.mappings[model].common_regions: - # if common region consists of single native region - # treat as rename that filters skip-region-aggregation variables - non_skip_vars = [ - var.name - for var in self.variable_codelist.values() - if not var.skip_region_aggregation - ] + # Single constituent common regions are a special rename case + # (technically aggregated, so aggregation-skipped variables are excluded) if common_region.is_single_constituent_region: _df = model_df.filter( region=common_region.constituent_regions[0], - variable=non_skip_vars, + variable=self.variable_codelist.region_aggregation_variables, ).rename(region=common_region.rename_dict) - regions = [common_region.name, common_region.constituent_regions] - - # first, perform 'simple' aggregation (no arguments) - simple_vars = [ - var - for var in self.variable_codelist.vars_default_args( - model_df.variable + if not _df.empty: + _processed_data.append(_df._data) + else: + # Use aggregation function + aggregated = aggregate_region_with_variable_rules( + model_df, + common_region.name, + common_region.constituent_regions, + self.variable_codelist, ) - ] - _df = model_df.aggregate_region( - simple_vars, - *regions, - ) - if _df is not None and not _df.empty: - _processed_data.append(_df._data) - - # second, special weighted aggregation - for var in self.variable_codelist.vars_kwargs(model_df.variable): - if var.region_aggregation is None: - _df = _aggregate_region( - model_df, - var.name, - *regions, - **var.pyam_agg_kwargs, - ) - if _df is not None and not _df.empty: - _processed_data.append(_df._data) - else: - for rename_var in var.region_aggregation: - for _rename, _kwargs in rename_var.items(): - _df = _aggregate_region( - model_df, - var.name, - *regions, - **_kwargs, - ) - if _df is not None and not _df.empty: - _processed_data.append( - _df.rename(variable={var.name: _rename})._data - ) - - # add pre-aggregated common region data to processed data - common_region_df = model_df.filter( - region=self.mappings[model].common_region_names, - variable=self.variable_codelist, + _processed_data.extend(aggregated) + + # Compare & merge with pre-aggregated data + _data, difference = merge_with_preaggregated_data( + model_df, + _processed_data, + self.mappings[model].common_region_names, + self.variable_codelist, + rtol_difference, + return_aggregation_difference, + model, ) - # concatenate and merge with data provided at common-region level - difference = pd.DataFrame() - if _processed_data: - _data = pd.concat(_processed_data) - if not common_region_df.empty: - _data, difference = _compare_and_merge( - common_region_df._data, - _data, - rtol_difference, - return_aggregation_difference, - ) - - # if data exists only at the common-region level - elif not common_region_df.empty: - _data = common_region_df._data - - # raise an error if region-processing yields an empty result - else: - raise ValueError( - f"Region-processing for model '{model}' returned an empty dataset" - ) - - # cast processed timeseries data and meta indicators to IamDataFrame return IamDataFrame(_data, meta=model_df.meta), difference def revert(self, df: pyam.IamDataFrame) -> pyam.IamDataFrame: @@ -793,6 +745,127 @@ def revert(self, df: pyam.IamDataFrame) -> pyam.IamDataFrame: return pyam.concat(model_dfs) +def aggregate_region_with_variable_rules( + df: IamDataFrame, + target_region: str, + constituent_regions: list[str], + variable_codelist: VariableCodeList, +) -> list[pd.Series]: + """ + Core region aggregation logic with variable-specific rules. + + This is the shared aggregation engine used by different processors. + It handles: + - Variables with simple aggregation (sum) + - Variables with weighted aggregation + - Variables with custom methods + - Variables with skip_region_aggregation flag + + Parameters + ---------- + df : IamDataFrame + Source data + target_region : str + Name of region to create + constituent_regions : list of str + Regions to aggregate from + variable_codelist : VariableCodeList + Variable definitions with aggregation rules + + Returns + ------- + list of pd.Series + Aggregated data series + """ + aggregated_data = [] + regions = [target_region, constituent_regions] + + # Simple aggregation (default sum) + simple_vars = [var for var in variable_codelist.vars_default_args(df.variable)] + if simple_vars: + _df = df.aggregate_region(simple_vars, *regions) + if _df is not None and not _df.empty: + aggregated_data.append(_df._data) + + # Weighted/special aggregation + for var in variable_codelist.vars_kwargs(df.variable): + if var.region_aggregation is None: + # Standard weighted aggregation + _df = _aggregate_region(df, var.name, *regions, **var.pyam_agg_kwargs) + if _df is not None and not _df.empty: + aggregated_data.append(_df._data) + else: + # Aggregation with variable renaming + for rename_var in var.region_aggregation: + for _rename, _kwargs in rename_var.items(): + _df = _aggregate_region(df, var.name, *regions, **_kwargs) + if _df is not None and not _df.empty: + aggregated_data.append( + _df.rename(variable={var.name: _rename})._data + ) + + return aggregated_data + + +def merge_with_preaggregated_data( + model_df: IamDataFrame, + aggregated_data: list[pd.Series], + target_regions: list[str], + variable_codelist: VariableCodeList, + rtol_difference: float = 0.01, + return_aggregation_difference: bool = False, + model_name: str = "", +) -> tuple[pd.Series, pd.DataFrame]: + """Merge aggregated data with any pre-aggregated data that exists at target regions. + + Parameters + ---------- + model_df : IamDataFrame + Original model data + aggregated_data : list of pd.Series + List of aggregated data series + target_regions : list of str + Regions to filter for pre-aggregated data + variable_codelist : VariableCodeList + Variables to include + rtol_difference : float + Relative tolerance for comparison + return_aggregation_difference : bool + Whether to return difference dataframe + model_name : str + Model name for error messages + + Returns + ------- + tuple of (pd.Series, pd.DataFrame) + Merged data and difference report + """ + # Filter for pre-aggregated data + pre_aggregated_df = model_df.filter( + region=target_regions, + variable=variable_codelist, + ) + + difference = pd.DataFrame() + if aggregated_data: + _data = pd.concat(aggregated_data) + if not pre_aggregated_df.empty: + _data, difference = _compare_and_merge( + pre_aggregated_df._data, + _data, + rtol_difference, + return_aggregation_difference, + ) + elif not pre_aggregated_df.empty: + _data = pre_aggregated_df._data + else: + raise ValueError( + f"Region-processing for model '{model_name}' returned an empty dataset" + ) + + return _data, difference + + def _aggregate_region(df, var, *regions, **kwargs): """Perform region aggregation with kwargs catching inconsistent-index errors""" try: diff --git a/tests/data/nuts_processing/dsd/definitions/region/regions.yaml b/tests/data/nuts_processing/dsd/definitions/region/regions.yaml new file mode 100644 index 00000000..8e4e8cd5 --- /dev/null +++ b/tests/data/nuts_processing/dsd/definitions/region/regions.yaml @@ -0,0 +1,3 @@ +- Supranational: + - European Union + - European Union and United Kingdom diff --git a/tests/data/nuts_processing/dsd/definitions/variable/variables.yaml b/tests/data/nuts_processing/dsd/definitions/variable/variables.yaml new file mode 100644 index 00000000..79342dfc --- /dev/null +++ b/tests/data/nuts_processing/dsd/definitions/variable/variables.yaml @@ -0,0 +1,3 @@ +- Primary Energy: + definition: Total primary energy consumption + unit: EJ/yr \ No newline at end of file diff --git a/tests/data/nuts_processing/dsd/nomenclature.yaml b/tests/data/nuts_processing/dsd/nomenclature.yaml new file mode 100644 index 00000000..9efde39f --- /dev/null +++ b/tests/data/nuts_processing/dsd/nomenclature.yaml @@ -0,0 +1,12 @@ +dimensions: + - region + - variable +definitions: + region: + country: true + nuts: + nuts-1: [AT] + nuts-2: [AT] + nuts-3: [AT] +processors: + nuts: [model_a] \ No newline at end of file diff --git a/tests/data/nuts_processing/dsd_no_eu/definitions/region/regions.yaml b/tests/data/nuts_processing/dsd_no_eu/definitions/region/regions.yaml new file mode 100644 index 00000000..bf7aa984 --- /dev/null +++ b/tests/data/nuts_processing/dsd_no_eu/definitions/region/regions.yaml @@ -0,0 +1,29 @@ +- Country: + - Austria + - Belgium + - Bulgaria + - Croatia + - Cyprus + - Czechia + - Denmark + - Estonia + - Finland + - France + - Germany + - Greece + - Hungary + - Ireland + - Italy + - Latvia + - Lithuania + - Luxembourg + - Malta + - Netherlands + - Poland + - Portugal + - Romania + - Slovakia + - Slovenia + - Spain + - Sweden + - United Kingdom diff --git a/tests/data/nuts_processing/dsd_no_eu/definitions/variable/variables.yaml b/tests/data/nuts_processing/dsd_no_eu/definitions/variable/variables.yaml new file mode 100644 index 00000000..7a367a2b --- /dev/null +++ b/tests/data/nuts_processing/dsd_no_eu/definitions/variable/variables.yaml @@ -0,0 +1,3 @@ +- Primary Energy: + definition: Total primary energy consumption + unit: EJ/yr diff --git a/tests/data/nuts_processing/dsd_no_eu/nomenclature.yaml b/tests/data/nuts_processing/dsd_no_eu/nomenclature.yaml new file mode 100644 index 00000000..2584aebd --- /dev/null +++ b/tests/data/nuts_processing/dsd_no_eu/nomenclature.yaml @@ -0,0 +1,11 @@ +dimensions: + - region + - variable +definitions: + region: + nuts: + nuts-1: [AT] + nuts-2: [AT] + nuts-3: [AT] +processors: + nuts: [model_a] diff --git a/tests/data/processor/region_processor/definitions/region/regions.yaml b/tests/data/processor/region_processor/definitions/region/regions.yaml new file mode 100644 index 00000000..28e01921 --- /dev/null +++ b/tests/data/processor/region_processor/definitions/region/regions.yaml @@ -0,0 +1,6 @@ +- common: + - World +- model_native: + - region_A + - region_B + - region_C diff --git a/tests/data/processor/region_processor/definitions/variable/variables.yaml b/tests/data/processor/region_processor/definitions/variable/variables.yaml new file mode 100644 index 00000000..7a367a2b --- /dev/null +++ b/tests/data/processor/region_processor/definitions/variable/variables.yaml @@ -0,0 +1,3 @@ +- Primary Energy: + definition: Total primary energy consumption + unit: EJ/yr diff --git a/tests/data/processor/region_processor/mappings/model_a.yaml b/tests/data/processor/region_processor/mappings/model_a.yaml new file mode 100644 index 00000000..8200617f --- /dev/null +++ b/tests/data/processor/region_processor/mappings/model_a.yaml @@ -0,0 +1,7 @@ +model: model_a +common_regions: + - World: + - region_A + - region_B +exclude_regions: + - region_C diff --git a/tests/data/processor/region_processor/nomenclature.yaml b/tests/data/processor/region_processor/nomenclature.yaml new file mode 100644 index 00000000..c0ebed71 --- /dev/null +++ b/tests/data/processor/region_processor/nomenclature.yaml @@ -0,0 +1,5 @@ +dimensions: + - region + - variable +processors: + region-processor: true diff --git a/tests/test_core.py b/tests/test_core.py index cf838ee9..d55674b8 100644 --- a/tests/test_core.py +++ b/tests/test_core.py @@ -1,4 +1,5 @@ import copy +from unittest.mock import patch import numpy as np import pandas as pd @@ -614,3 +615,68 @@ def test_region_aggregation_unknown_region(simple_df, simple_definition, caplog) RegionProcessor.from_directory( TEST_DATA_DIR / "region_processing" / "no_mapping", simple_definition ).apply(df_with_unknown_region) + + +CONFIG_PROCESSOR_DIR = TEST_DATA_DIR / "processor" + + +def test_config_region_processor_auto_loaded(): + """ + Test `region-processor: true` in nomenclature.yaml creates `RegionProcessor` + from the default mappings directory. + """ + test_df = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "region_A", "Primary Energy", "EJ/yr", 1, 2], + ["model_a", "scen_a", "region_B", "Primary Energy", "EJ/yr", 3, 4], + ["model_a", "scen_a", "region_C", "Primary Energy", "EJ/yr", 5, 6], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + exp = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "World", "Primary Energy", "EJ/yr", 4, 6], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + dsd = DataStructureDefinition(CONFIG_PROCESSOR_DIR / "region_processor/definitions") + obs = process(test_df, dsd) + + assert_iamframe_equal(obs, exp) + + +def test_config_and_explicit_region_processor_raise(): + """ + Test that providing an explicit `RegionProcessor` when config also declares one + raises a `ValueError`. + """ + test_df = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "region_A", "Primary Energy", "EJ/yr", 1, 2], + ["model_a", "scen_a", "region_B", "Primary Energy", "EJ/yr", 3, 4], + ["model_a", "scen_a", "region_C", "Primary Energy", "EJ/yr", 5, 6], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + dsd = DataStructureDefinition(CONFIG_PROCESSOR_DIR / "region_processor/definitions") + explicit_rp = RegionProcessor.from_directory( + CONFIG_PROCESSOR_DIR / "region_processor/mappings", dsd + ) + + with patch.object( + RegionProcessor, "from_directory", wraps=RegionProcessor.from_directory + ) as from_dir: + with pytest.raises( + ValueError, match="Config declares 'region-processor: true' but an explicit" + ): + process(test_df, dsd, processor=explicit_rp) + from_dir.assert_not_called() diff --git a/tests/test_nuts_aggregation.py b/tests/test_nuts_aggregation.py new file mode 100644 index 00000000..8a59c207 --- /dev/null +++ b/tests/test_nuts_aggregation.py @@ -0,0 +1,198 @@ +# tests/test_nuts_aggregation.py +import pandas as pd +import pytest +from pathlib import Path +from pyam import IamDataFrame, assert_iamframe_equal +from pyam.utils import IAMC_IDX + +from nomenclature import DataStructureDefinition +from nomenclature.processor.nuts import NutsProcessor, EU27_MIN_COUNTRIES + +here = Path(__file__).parent +TEST_DATA_DIR = here / "data" +NUTS_TEST_DIR = TEST_DATA_DIR / "nuts_processing" / "dsd" +NUTS_NO_EU_TEST_DIR = TEST_DATA_DIR / "nuts_processing" / "dsd_no_eu" + +# 27 EU member country names (as returned by the countries module) +EU27_NAMES = [ + "Austria", + "Belgium", + "Bulgaria", + "Croatia", + "Cyprus", + "Germany", + "Denmark", + "Estonia", + "Finland", + "France", + "Greece", + "Hungary", + "Ireland", + "Italy", + "Lithuania", + "Luxembourg", + "Latvia", + "Malta", + "Netherlands", + "Poland", + "Portugal", + "Romania", + "Sweden", + "Slovenia", + "Slovakia", + "Spain", + "Czechia", +] +assert len(EU27_NAMES) == 27 + +UK_NAME = "United Kingdom" + + +def _make_country_df(countries: list[str], value: float = 1.0) -> IamDataFrame: + """Build a minimal IamDataFrame with one row per country.""" + return IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", c, "Primary Energy", "EJ/yr", value, value] + for c in countries + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + +def test_nuts_simple_aggregation(): + """Test basic NUTS3 -> NUTS2 -> NUTS1 -> Country aggregation""" + + # Create test data with NUTS3 regions (Austria) + # AT111, AT112 should aggregate to AT11 (NUTS2) -> AT1 (NUTS1) -> Austria + test_df = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "AT111", "Primary Energy", "EJ/yr", 1.0, 2.0], + ["model_a", "scen_a", "AT112", "Primary Energy", "EJ/yr", 3.0, 4.0], + ["model_a", "scen_a", "Belgium", "Primary Energy", "EJ/yr", 5.0, 6.0], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + # Expected output: aggregated hierarchies + all original/intermediate NUTS regions + expected = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "AT1", "Primary Energy", "EJ/yr", 4.0, 6.0], + ["model_a", "scen_a", "AT11", "Primary Energy", "EJ/yr", 4.0, 6.0], + ["model_a", "scen_a", "AT111", "Primary Energy", "EJ/yr", 1.0, 2.0], + ["model_a", "scen_a", "AT112", "Primary Energy", "EJ/yr", 3.0, 4.0], + ["model_a", "scen_a", "Austria", "Primary Energy", "EJ/yr", 4.0, 6.0], + ["model_a", "scen_a", "Belgium", "Primary Energy", "EJ/yr", 5.0, 6.0], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + # Load DSD and apply NUTS processor + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + + result = processor.apply(test_df) + + assert_iamframe_equal(result, expected) + + +def test_nuts_duplicate_aggregation_raises(): + """Test that NUTS aggregation on a region and its children raises.""" + + test_df = IamDataFrame( + pd.DataFrame( + [ + ["model_a", "scen_a", "AT111", "Primary Energy", "EJ/yr", 1.0, 2.0], + ["model_a", "scen_a", "AT112", "Primary Energy", "EJ/yr", 3.0, 4.0], + ["model_a", "scen_a", "AT11", "Primary Energy", "EJ/yr", 5.0, 6.0], + ], + columns=IAMC_IDX + [2005, 2010], + ) + ) + + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + + with pytest.raises(ValueError, match="Duplicate rows in `data`"): + processor.apply(test_df) + + +def test_eu27_aggregation_sufficient_countries(): + """EU27 aggregate is produced when at least 23 members are present.""" + countries = EU27_NAMES[:EU27_MIN_COUNTRIES] + test_df = _make_country_df(countries) + + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + result = processor.apply(test_df) + + # All original countries still present, plus European Union + assert "European Union" in result.region + # Individual country data preserved + for country in countries: + assert country in result.region + # No EU+UK since UK not in data + assert "European Union and United Kingdom" not in result.region + + +def test_eu27_aggregation_insufficient_countries(): + """No EU27 aggregate produced when fewer than 23 members present.""" + countries = EU27_NAMES[:3] + test_df = _make_country_df(countries) + + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + result = processor.apply(test_df) + + assert "European Union" not in result.region + assert "European Union and United Kingdom" not in result.region + + +def test_eu27_uk_aggregation_with_uk(): + """Test both EU27 and EU27+UK are produced when UK is present.""" + countries = EU27_NAMES[:EU27_MIN_COUNTRIES] + [UK_NAME] + test_df = _make_country_df(countries) + + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + result = processor.apply(test_df) + + assert "European Union" in result.region + assert "European Union and United Kingdom" in result.region + + # EU values: sum of 23 countries at value=1.0 + eu_value = float(EU27_MIN_COUNTRIES) + eu_data = result.filter(region="European Union") + assert (eu_data.data["value"] == eu_value).all() + eu_uk_data = result.filter(region="European Union and United Kingdom") + assert (eu_uk_data.data["value"] == eu_value + 1.0).all() + + +def test_eu27_aggregation_without_uk(): + """Test only EU27 (not EU27+UK) is produced when UK is absent.""" + countries = EU27_NAMES[:EU27_MIN_COUNTRIES] + test_df = _make_country_df(countries) + + dsd = DataStructureDefinition(NUTS_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + result = processor.apply(test_df) + + assert "European Union" in result.region + assert "European Union and United Kingdom" not in result.region + + +def test_eu27_aggregation_codelist_gating(): + """Test no EU aggregation is attempted when 'European Union' not in region codelist.""" + test_df = _make_country_df(EU27_NAMES) + + dsd = DataStructureDefinition(NUTS_NO_EU_TEST_DIR / "definitions") + processor = NutsProcessor.from_definition(dsd) + result = processor.apply(test_df) + + assert "European Union" not in result.region + assert "European Union and United Kingdom" not in result.region