-
Notifications
You must be signed in to change notification settings - Fork 965
Add 13 German country-specific predefined PII PatternRecognizers and… #1830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
shokrydev
wants to merge
12
commits into
microsoft:main
Choose a base branch
from
shokrydev:feat/german-patternrecognizers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+3,028
−2
Open
Changes from 4 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
53bc259
Add 13 German country-specific predefined PII PatternRecognizers and…
shokrydev 9e8304a
Apply all suggestions from copilot review except the suggestion for P…
shokrydev 9d1ca9a
Update German postal code range in recognizer
shokrydev 15cae06
Merge branch 'main' into feat/german-patternrecognizers
omri374 5396e3f
Extend German license plate recognizer (E/H suffix, umlaute), fix pas…
shokrydev c390342
Merge pull request #14 from microsoft/main
shokrydev 9341cd3
Merge branch 'main' into feat/german-patternrecognizers
omri374 3c5081a
Merge branch 'main' into feat/german-patternrecognizers
omri374 ad05a1b
Merge branch 'main' into feat/german-patternrecognizers
omri374 7c68412
Merge branch 'main' into feat/german-patternrecognizers
SharonHart 7c0cc5c
Merge branch 'main' into feat/german-patternrecognizers
omri374 9c75f10
Merge branch 'main' into feat/german-patternrecognizers
omri374 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
31 changes: 31 additions & 0 deletions
31
...io-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| """Germany-specific recognizers.""" | ||
|
|
||
| from .de_bsnr_recognizer import DeBsnrRecognizer | ||
| from .de_commercial_register_recognizer import DeCommercialRegisterRecognizer | ||
| from .de_driver_license_recognizer import DeDriverLicenseRecognizer | ||
| from .de_kvnr_recognizer import DeKvnrRecognizer | ||
| from .de_lanr_recognizer import DeLanrRecognizer | ||
| from .de_license_plate_recognizer import DeLicensePlateRecognizer | ||
| from .de_passport_recognizer import DePassportRecognizer | ||
| from .de_personal_id_recognizer import DePersonalIdRecognizer | ||
| from .de_postal_code_recognizer import DePostalCodeRecognizer | ||
| from .de_social_security_recognizer import DeSocialSecurityRecognizer | ||
| from .de_tax_id_recognizer import DeTaxIdRecognizer | ||
| from .de_telematik_id_recognizer import DeTelematikIdRecognizer | ||
| from .de_vat_code_recognizer import DeVatCodeRecognizer | ||
|
|
||
| __all__ = [ | ||
| "DeBsnrRecognizer", | ||
| "DeCommercialRegisterRecognizer", | ||
| "DeDriverLicenseRecognizer", | ||
| "DeKvnrRecognizer", | ||
| "DeLanrRecognizer", | ||
| "DeLicensePlateRecognizer", | ||
| "DePassportRecognizer", | ||
| "DePersonalIdRecognizer", | ||
| "DePostalCodeRecognizer", | ||
| "DeSocialSecurityRecognizer", | ||
| "DeTaxIdRecognizer", | ||
| "DeTelematikIdRecognizer", | ||
| "DeVatCodeRecognizer", | ||
| ] |
142 changes: 142 additions & 0 deletions
142
...r/presidio_analyzer/predefined_recognizers/country_specific/germany/de_bsnr_recognizer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,142 @@ | ||
| from typing import List, Optional, Tuple | ||
|
|
||
| from presidio_analyzer import EntityRecognizer, Pattern, PatternRecognizer | ||
|
|
||
|
|
||
| class DeBsnrRecognizer(PatternRecognizer): | ||
| """ | ||
| Recognize German BSNR (Betriebsstättennummer) using regex and validation. | ||
|
|
||
| The BSNR is a facility number that uniquely identifies the location of | ||
| service provision in the German statutory health insurance system: | ||
| - 9 digits total | ||
| - Digits 1-2: KV state/regional association code (see VALID_KV_CODES) | ||
| - Digits 3-7: Facility identifier (assigned by KV) | ||
| - Digits 8-9: Additional digits (often "00" for older BSNRs) | ||
|
|
||
| The BSNR appears in prescriptions, discharge letters, and billing documents, | ||
| identifying the treatment facility. This is quasi-PII as it can narrow down | ||
| where a patient received treatment. | ||
|
|
||
| Legal basis: §75 Abs. 7 SGB V | ||
| Issuing authority: Kassenärztliche Vereinigungen (KV) | ||
| Source: KBV Arztnummern-Richtlinie Anlage 1 | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| :param replacement_pairs: List of tuples with potential replacement values | ||
| """ | ||
|
|
||
| # Pattern source: https://wiki.hl7.de/index.php/LANR_und_BSNR | ||
|
|
||
| # Valid KV region codes per KBV Arztnummern-Richtlinie Anlage 1 | ||
| # Standard KV regions | ||
| VALID_KV_CODES = { | ||
| "01", # Schleswig-Holstein | ||
| "02", # Hamburg | ||
| "03", # Bremen | ||
| "17", # Niedersachsen | ||
| "20", # Westfalen-Lippe | ||
| "38", # Nordrhein | ||
| "46", # Hessen | ||
| "51", # Rheinland-Pfalz | ||
| "52", # Baden-Württemberg | ||
| "71", # Bayern | ||
| "72", # Berlin | ||
| "73", # Saarland | ||
| "74", # KBV (Kassenärztliche Bundesvereinigung) | ||
| "78", # Mecklenburg-Vorpommern | ||
| "83", # Brandenburg | ||
| "88", # Sachsen-Anhalt | ||
| "93", # Thüringen | ||
| "98", # Sachsen | ||
| # Special codes for hospitals (Anlage 8 BMV-Ä) | ||
| "35", # Krankenhäuser | ||
| } | ||
|
|
||
| PATTERNS = [ | ||
| Pattern( | ||
| "BSNR (9 digits)", | ||
| r"\b[0-9]{9}\b", | ||
| 0.05, # Very low score - requires context or valid KV code | ||
| ), | ||
| Pattern( | ||
| "BSNR (with context)", | ||
| r"(?i)(?:bsnr|betriebsstättennummer|betriebsstaetten-nr|betriebsstätten-nr)[\s:]*([0-9]{9})\b", | ||
| 0.5, | ||
| ), | ||
| ] | ||
|
|
||
| CONTEXT = [ | ||
| "bsnr", | ||
| "betriebsstättennummer", | ||
| "betriebsst\u00e4ttennummer", # With umlaut | ||
| "betriebsstaetten-nr", | ||
| "betriebsst\u00e4tten-nr", # With umlaut | ||
| "facility number", | ||
| "praxis", | ||
| "praxisnummer", | ||
| "behandlungsort", | ||
| "einrichtung", | ||
| "klinik", | ||
| "krankenhaus", | ||
| "behandlungsstelle", | ||
| ] | ||
|
|
||
| def __init__( | ||
| self, | ||
| patterns: Optional[List[Pattern]] = None, | ||
| context: Optional[List[str]] = None, | ||
| supported_language: str = "de", | ||
| supported_entity: str = "DE_BSNR", | ||
| replacement_pairs: Optional[List[Tuple[str, str]]] = None, | ||
| name: Optional[str] = None, | ||
| ): | ||
| self.replacement_pairs = ( | ||
| replacement_pairs if replacement_pairs else [("-", ""), (" ", ""), (".", "")] | ||
| ) | ||
| patterns = patterns if patterns else self.PATTERNS | ||
| context = context if context else self.CONTEXT | ||
| super().__init__( | ||
| supported_entity=supported_entity, | ||
| patterns=patterns, | ||
| context=context, | ||
| supported_language=supported_language, | ||
| name=name, | ||
| ) | ||
|
|
||
| def validate_result(self, pattern_text: str) -> Optional[bool]: | ||
| """ | ||
| Validate the BSNR format using KV regional code validation. | ||
|
|
||
| Validates that the first 2 digits match a valid KV region code | ||
| per KBV Arztnummern-Richtlinie Anlage 1. | ||
|
|
||
| :param pattern_text: Text detected as pattern by regex | ||
| :return: True if valid KV code, False if invalid, None if uncertain | ||
| """ | ||
| pattern_text = EntityRecognizer.sanitize_value( | ||
| pattern_text, self.replacement_pairs | ||
| ) | ||
|
|
||
| if len(pattern_text) != 9: | ||
| return False | ||
|
|
||
| if not pattern_text.isdigit(): | ||
| return False | ||
|
|
||
| # Basic validation: BSNR should not be all zeros | ||
| if pattern_text == "000000000": | ||
| return False | ||
|
|
||
| # Validate KV regional code (digits 1-2) | ||
| kv_code = pattern_text[:2] | ||
| if kv_code in self.VALID_KV_CODES: | ||
| # Valid KV code - increase confidence | ||
| return True | ||
|
|
||
| # Unknown KV code - could be valid (historic or special cases) | ||
| # but reduce confidence by returning None | ||
| return None |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.