-
Notifications
You must be signed in to change notification settings - Fork 969
Language models integration (LangExtract) #1775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 87 commits
Commits
Show all changes
114 commits
Select commit
Hold shift + click to select a range
8c0e19a
Add LangExtract recognizer for PII extraction
RonShakutai e3f68d1
refine the docs
RonShakutai 811c639
narrow support for oollama only
RonShakutai 3640527
Refactor LangExtract tests to use Ollama; remove API key dependency
RonShakutai 9a99a0d
adding first draft of docker compose
RonShakutai 3c88dd9
Update model_id in tests to use 'gemma2:2b' instead of 'gemini-2.5-fl…
RonShakutai feaee53
Refactor LangExtract documentation to focus on Ollama support; remove…
RonShakutai 013198e
Update README to remove Ollama setup instructions and clarify integra…
RonShakutai f8326f9
Enhance Ollama installation script with progress messages and error h…
RonShakutai f8b7d03
auto ruff fixes
RonShakutai dcc1450
Enhance LangExtractRecognizer tests with real Ollama integration
RonShakutai 64d0f09
Remove unnecessary line breaks in LLM-based PII detection section of …
RonShakutai 0905619
Add LangExtract LLM-based PII detection test and configuration
RonShakutai d5dde47
Improve Ollama availability check with setup attempt message
RonShakutai 0afa506
Increase wait time for services and update healthcheck parameters for…
RonShakutai 7fbf111
Add Ollama setup for Analyzer tests and improve availability check
RonShakutai 31c0f93
Set timeout for Ollama setup in Analyzer tests to 8 minutes
RonShakutai f087983
Enhance Ollama setup for Analyzer tests with improved installation an…
RonShakutai ed9917f
Update Ollama model references from gemma2:2b to llama3.2:1b across c…
RonShakutai ffa44e1
Update model references from llama3.2:1b to gemma2:2b across configur…
RonShakutai 1e0d41b
Remove 'enabled' configuration from LangExtract settings in YAML file…
RonShakutai 574312f
Update Ollama service configuration: change port mapping and modify h…
RonShakutai 304dbf3
Refactor logging in LangExtractRecognizer: reduce verbosity and impro…
RonShakutai f808daf
Update Ollama service configuration: modify port mapping and healthch…
RonShakutai 27ee08b
Update CI workflow and tests: reduce sleep duration and add environme…
RonShakutai 1b9e97d
Reduce sleep duration in CI workflow from 150 to 60 seconds
RonShakutai f9e5435
Update LangExtract model references from gemma2:2b to gemma3:1b and r…
RonShakutai 299f0ea
docs and prompt fixes
RonShakutai 2e593b9
finalizing the pr
RonShakutai 109a25e
Update Ollama image to latest version and add LangExtract PII/PHI ext…
RonShakutai 054ea1c
fix bad example
RonShakutai fe10361
fix unit-tests
RonShakutai af51ceb
refactor: clean up .env file and simplify skip_engine logic in tests
RonShakutai 0e866dd
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai 159c955
chore: add a new line to .env file for better readability
RonShakutai a4a63c4
chore: remove unnecessary blank line from .env file
RonShakutai 320fe9e
revert .env
RonShakutai d77958a
intial commit
RonShakutai 286b78a
Remove skip marker for spacy_nlp_engine fixture
RonShakutai b50a6be
Remove skip markers for stanza and transformers NLP engine fixtures
RonShakutai d7a91c2
Merge branch 'fix-ci-unit-tests' of https://github.com/microsoft/pres…
RonShakutai c51a695
Remove Ollama recognizer test and update default recognizers configur…
RonShakutai f56b97e
Remove unused Ollama recognizer configuration and update prompt file …
RonShakutai 92b1136
Add end-to-end tests for API anonymization and redaction features
RonShakutai 253c220
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai 4b82e84
Remove unused Ollama recognizer imports and related tests
RonShakutai 8c81ca7
Update requirements and improve Ollama recognizer availability checks…
RonShakutai d205959
Fix formatting in requirements.txt for analyzer and anonymizer depend…
RonShakutai 2ee5179
Update Ollama model ID from gemma3:1b to gemma2:2b in configuration a…
RonShakutai 1dc2328
gemma2:2b
RonShakutai 26d5097
finalizing the pr
RonShakutai 7684c79
Remove unused ABC import from lm_recognizer.py
RonShakutai 097ef84
Fix indentation in docker-compose.yml for volumes section
RonShakutai e693ca6
Fix line break for clarity in adding_recognizers.md
RonShakutai 95e559a
Add timeout settings for Ollama recognizer and test cases
RonShakutai 7b7f8fa
Refactor timeout comment for clarity in OllamaLangExtractRecognizer
RonShakutai f10a103
Update Ollama model version and add configuration for LangExtract rec…
RonShakutai 703aab6
Update examples_file path in configuration for Ollama recognizer
RonShakutai 034a1bf
Remove timeout decorator from Ollama recognizer
RonShakutai 768cb0e
Add rerun settings to unit and E2E tests for improved stability
RonShakutai 0f8fad6
Remove rerun settings from unit and E2E test commands for simplification
RonShakutai aed8a2d
Set max-parallel to 2 for local build and E2E tests
RonShakutai 5e4e0a0
Remove max-parallel setting from local build and E2E tests
RonShakutai 0220c5b
move poetry cache dir
tamirkamara f30bb7a
Merge branch 'tamirkamara/fix-disk-space' of https://github.com/micro…
RonShakutai 7b5f75a
Merge branch 'main' into feature-langextract
SharonHart 2e8165c
pr changes
RonShakutai 0ef9d31
code review changes
RonShakutai be7ee67
ruff check
RonShakutai 1e1e107
remove unused json import in test_ollama_recognizer.py
RonShakutai 083e853
refactor test names for clarity and consistency in test_ollama_recogn…
RonShakutai 3a339b7
finalizing the PR
RonShakutai 43a196e
Merge branch 'main' into feature-langextract
RonShakutai d377a56
self code review fixes
RonShakutai e27faa7
Merge branch 'feature-langextract' of https://github.com/microsoft/pr…
RonShakutai f9097ff
Refactor LangExtractRecognizer to use yaml for configuration loading
RonShakutai 68275e3
ruff fixes
RonShakutai 21fb33c
Update error messages in OllamaLangExtractRecognizer tests for clarity
RonShakutai 679b363
CR comment addressed
RonShakutai 73cb9d1
Remove unused variables from Jinja2 prompt rendering in LangExtractRe…
RonShakutai da1c563
exporting functionality to helpers enlarging composition
RonShakutai 974eff6
composition
RonShakutai 5039eb2
Refactor entity mapper and langextract utilities for improved clarity…
RonShakutai f78ec8e
Refactor tests to use get_langextract_module for mocking LangExtract …
RonShakutai b27b09c
Refactor langextract utilities to improve clarity and error handling;…
RonShakutai bb58002
Refactor LLM utilities by simplifying docstrings and consolidating im…
RonShakutai 08ef556
Update error message for missing Jinja2 installation to include poetr…
RonShakutai a2d4db7
Add Ollama recognizer configuration and tests for YAML integration
RonShakutai 09a02f4
Refactor docstrings in OllamaLangExtractRecognizer for improved clari…
RonShakutai da16ede
Enhance OllamaLangExtractRecognizer initialization docstring to clari…
RonShakutai 6f72919
Merge branch 'main' into feature-langextract
RonShakutai 7185449
pr comments
RonShakutai d45ff0c
Refactor OllamaLangExtractRecognizer to streamline config path handli…
RonShakutai 6b027bb
Merge branch 'feature-langextract' of https://github.com/microsoft/pr…
RonShakutai d5e05e0
Refactor Ollama recognizer test to improve clarity and enhance entity…
RonShakutai 651ecce
Refactor tests for Ollama recognizer and LMRecognizer to improve exce…
RonShakutai 95f5a6a
Update config path for Ollama recognizer in test configuration
RonShakutai 85f1c7a
Update config paths for Ollama recognizer and add test configuration …
RonShakutai f87ae78
Remove test configuration for Ollama LangExtract
RonShakutai 8158cfd
Remove test configuration for Ollama LangExtract recognizer
RonShakutai 025a5b0
Fix formatting in resolve_config_path function for improved readability
RonShakutai 0b6c62f
Enable UsLangExtractRecognizer and update its config path
RonShakutai 1fe8a2b
change all configs to use gemma3:1b
RonShakutai f492cf5
Disable Ollama LangExtract recognizer and update its configuration path
RonShakutai 94320e0
Update langextract configuration paths to use absolute paths for prom…
RonShakutai 4024901
Remove test script for Ollama recognizer configuration loading
RonShakutai 8397d43
Refactor config loading in examples and prompt loaders to use resolve…
RonShakutai 2b42d6f
Refactor parameter description in load_yaml_examples and clean up imp…
RonShakutai 6af3bb9
Update langextract paths to use repo-root-relative paths in tests and…
RonShakutai 4dfe7fd
Enhance documentation for Ollama setup and improve __init__.py import…
RonShakutai 54418fb
code review changes
RonShakutai dcb9f8c
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai e682f60
pr comments & align to main
RonShakutai c43dd92
Merge branch 'main' into feature-langextract
RonShakutai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| # Language Model-based PII/PHI Detection | ||
RonShakutai marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Introduction | ||
|
|
||
| Presidio supports language model-based PII/PHI detection for flexible entity recognition using language models (LLMs, SLMs, etc.). This approach enables detection of both: | ||
| - **PII (Personally Identifiable Information)**: Names, emails, phone numbers, SSN, credit cards, etc. | ||
| - **PHI (Protected Health Information)**: Medical records, health identifiers, etc. | ||
|
|
||
| The current implementation uses [LangExtract](https://github.com/google/langextract) with **Ollama** for local model deployment. Additional provider integrations will be added soon. | ||
|
|
||
| ## Entity Detection Capabilities | ||
|
|
||
| Unlike pattern-based recognizers, language model-based detection is flexible and depends on: | ||
| - The language model being used | ||
| - The prompt description provided | ||
| - The few-shot examples configured | ||
|
|
||
| The default configuration includes examples for common PII/PHI entities such as PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, MEDICAL_LICENSE, and more. | ||
| **You can customize the prompts and examples to detect any entity types relevant to your use case**. | ||
|
|
||
| For the default entity mappings and examples, see the [default configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### Setting up Ollama | ||
|
|
||
| You have two options to set up Ollama: | ||
|
|
||
| **Option 1: Docker Compose** (recommended) | ||
RonShakutai marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```bash | ||
| # Start Ollama service | ||
| docker compose up -d ollama | ||
RonShakutai marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Pull the language model (required - takes ~1-2 minutes) | ||
| docker exec -it presidio-ollama-1 ollama pull gemma2:2b | ||
| ``` | ||
|
|
||
| **Option 2: Manual setup** | ||
| Follow the [official LangExtract Ollama guide](https://github.com/google/langextract?tab=readme-ov-file#using-local-llms-with-ollama). | ||
|
|
||
| !!! note "Note" | ||
| The model must be pulled before using the recognizer. The default model is `gemma2:2b` (~1.6GB). | ||
|
|
||
| ## Language Model-based Recognizer Implementation | ||
|
|
||
| Presidio provides a hierarchy of recognizers for language model-based PII/PHI detection: | ||
|
|
||
| - **`LMRecognizer`**: Abstract base class for all language model recognizers (LLMs, SLMs, etc.) | ||
| - **`LangExtractRecognizer`**: Abstract base class for LangExtract library integration (model-agnostic) | ||
| - **`OllamaLangExtractRecognizer`**: Concrete implementation for Ollama local language models | ||
|
|
||
| [The implementation can be found here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/ollama_langextract_recognizer.py). | ||
|
|
||
| ## How to integrate Language Model-based detection into Presidio | ||
|
|
||
| ### Option 1: Enable in Configuration (Recommended) | ||
|
|
||
| 1. Install with langextract support and set up Ollama (see Prerequisites above): | ||
| ```sh | ||
| pip install presidio-analyzer[langextract] | ||
| ``` | ||
|
|
||
| 2. Enable the recognizer in `default_recognizers.yaml`: | ||
| ```yaml | ||
| - name: OllamaLangExtractRecognizer | ||
| enabled: true # Change from false to true | ||
| ``` | ||
RonShakutai marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Option 2: Add Programmatically | ||
|
|
||
| ```python | ||
| from presidio_analyzer import AnalyzerEngine | ||
| from presidio_analyzer.predefined_recognizers.third_party.ollama_langextract_recognizer import OllamaLangExtractRecognizer | ||
|
|
||
| analyzer = AnalyzerEngine() | ||
| analyzer.registry.add_recognizer(OllamaLangExtractRecognizer()) | ||
|
|
||
| results = analyzer.analyze(text="My email is john.doe@example.com", language="en") | ||
| ``` | ||
|
|
||
| ### Custom Configuration | ||
|
|
||
| To use a custom configuration file: | ||
|
|
||
| ```python | ||
| analyzer.registry.add_recognizer( | ||
| OllamaLangExtractRecognizer(config_path="/path/to/custom_config.yaml") | ||
| ) | ||
| ``` | ||
|
|
||
| !!! note "Note" | ||
| The recognizer is disabled by default in `default_recognizers.yaml` to avoid requiring Ollama for basic Presidio usage. Enable it when you have Ollama set up and running. | ||
|
|
||
| ## Configuration Options | ||
|
|
||
| The `langextract_config_ollama.yaml` file supports the following options: | ||
|
|
||
| - **`model_id`**: The Ollama model to use (default: `"gemma2:2b"`) | ||
| - **`model_url`**: Ollama server URL (default: `"http://localhost:11434"`) | ||
| - **`temperature`**: Model temperature for generation (default: `null` for model default) | ||
| - **`supported_entities`**: PII/PHI entity types to detect | ||
| - **`entity_mappings`**: Map LangExtract entity classes to Presidio entity names | ||
| - **`min_score`**: Minimum confidence score (default: `0.5`) | ||
|
|
||
| See the [default configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml) for complete examples. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **ConnectionError: "Ollama server not reachable"** | ||
| - Ensure Ollama is running: `docker ps` or check `http://localhost:11434` | ||
| - Verify the `model_url` in your configuration matches your Ollama server address | ||
|
|
||
| **RuntimeError: "Model 'gemma2:2b' not found"** | ||
| - Pull the model: `docker exec -it presidio-ollama-1 ollama pull gemma2:2b` | ||
| - Or for manual setup: `ollama pull gemma2:2b` | ||
| - Verify the model name matches the `model_id` in your configuration | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| requests>=2.32.4 | ||
| pytest | ||
| file:../presidio-analyzer | ||
| file:../presidio-anonymizer | ||
| -e ../presidio-analyzer[langextract] | ||
| -e ../presidio-anonymizer |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # LMRecognizer base configuration | ||
| lm_recognizer: | ||
| supported_entities: | ||
| - PERSON | ||
| - LOCATION | ||
| - ORGANIZATION | ||
| - PHONE_NUMBER | ||
| - EMAIL_ADDRESS | ||
| - DATE_TIME | ||
| - US_SSN | ||
| - CREDIT_CARD | ||
| - MEDICAL_LICENSE | ||
| - IP_ADDRESS | ||
| - URL | ||
| - IBAN_CODE | ||
|
|
||
| labels_to_ignore: | ||
| - payment_status | ||
|
|
||
| enable_generic_consolidation: true | ||
| min_score: 0.5 | ||
|
|
||
| langextract: | ||
| prompt_file: langextract_prompts/default_pii_phi_prompt.j2 | ||
| examples_file: langextract_prompts/default_pii_phi_examples.yaml | ||
|
|
||
| entity_mappings: | ||
| person: PERSON | ||
| full_name: PERSON | ||
| name_first: PERSON | ||
| name_last: PERSON | ||
| name_middle: PERSON | ||
| location: LOCATION | ||
| address: LOCATION | ||
| organization: ORGANIZATION | ||
| phone: PHONE_NUMBER | ||
| phone_number: PHONE_NUMBER | ||
| email: EMAIL_ADDRESS | ||
| date: DATE_TIME | ||
| ssn: US_SSN | ||
| identification_number: US_SSN | ||
| credit_card: CREDIT_CARD | ||
| medical_record: MEDICAL_LICENSE | ||
| ip_address: IP_ADDRESS | ||
| url: URL | ||
| iban: IBAN_CODE | ||
|
|
||
| model: | ||
| model_id: gemma3:1b | ||
| model_url: http://localhost:11434 | ||
| temperature: 0.0 |
File renamed without changes.
File renamed without changes.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.