-
Notifications
You must be signed in to change notification settings - Fork 967
Language models integration (LangExtract) #1775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
114 commits
Select commit
Hold shift + click to select a range
8c0e19a
Add LangExtract recognizer for PII extraction
RonShakutai e3f68d1
refine the docs
RonShakutai 811c639
narrow support for oollama only
RonShakutai 3640527
Refactor LangExtract tests to use Ollama; remove API key dependency
RonShakutai 9a99a0d
adding first draft of docker compose
RonShakutai 3c88dd9
Update model_id in tests to use 'gemma2:2b' instead of 'gemini-2.5-fl…
RonShakutai feaee53
Refactor LangExtract documentation to focus on Ollama support; remove…
RonShakutai 013198e
Update README to remove Ollama setup instructions and clarify integra…
RonShakutai f8326f9
Enhance Ollama installation script with progress messages and error h…
RonShakutai f8b7d03
auto ruff fixes
RonShakutai dcc1450
Enhance LangExtractRecognizer tests with real Ollama integration
RonShakutai 64d0f09
Remove unnecessary line breaks in LLM-based PII detection section of …
RonShakutai 0905619
Add LangExtract LLM-based PII detection test and configuration
RonShakutai d5dde47
Improve Ollama availability check with setup attempt message
RonShakutai 0afa506
Increase wait time for services and update healthcheck parameters for…
RonShakutai 7fbf111
Add Ollama setup for Analyzer tests and improve availability check
RonShakutai 31c0f93
Set timeout for Ollama setup in Analyzer tests to 8 minutes
RonShakutai f087983
Enhance Ollama setup for Analyzer tests with improved installation an…
RonShakutai ed9917f
Update Ollama model references from gemma2:2b to llama3.2:1b across c…
RonShakutai ffa44e1
Update model references from llama3.2:1b to gemma2:2b across configur…
RonShakutai 1e0d41b
Remove 'enabled' configuration from LangExtract settings in YAML file…
RonShakutai 574312f
Update Ollama service configuration: change port mapping and modify h…
RonShakutai 304dbf3
Refactor logging in LangExtractRecognizer: reduce verbosity and impro…
RonShakutai f808daf
Update Ollama service configuration: modify port mapping and healthch…
RonShakutai 27ee08b
Update CI workflow and tests: reduce sleep duration and add environme…
RonShakutai 1b9e97d
Reduce sleep duration in CI workflow from 150 to 60 seconds
RonShakutai f9e5435
Update LangExtract model references from gemma2:2b to gemma3:1b and r…
RonShakutai 299f0ea
docs and prompt fixes
RonShakutai 2e593b9
finalizing the pr
RonShakutai 109a25e
Update Ollama image to latest version and add LangExtract PII/PHI ext…
RonShakutai 054ea1c
fix bad example
RonShakutai fe10361
fix unit-tests
RonShakutai af51ceb
refactor: clean up .env file and simplify skip_engine logic in tests
RonShakutai 0e866dd
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai 159c955
chore: add a new line to .env file for better readability
RonShakutai a4a63c4
chore: remove unnecessary blank line from .env file
RonShakutai 320fe9e
revert .env
RonShakutai d77958a
intial commit
RonShakutai 286b78a
Remove skip marker for spacy_nlp_engine fixture
RonShakutai b50a6be
Remove skip markers for stanza and transformers NLP engine fixtures
RonShakutai d7a91c2
Merge branch 'fix-ci-unit-tests' of https://github.com/microsoft/pres…
RonShakutai c51a695
Remove Ollama recognizer test and update default recognizers configur…
RonShakutai f56b97e
Remove unused Ollama recognizer configuration and update prompt file …
RonShakutai 92b1136
Add end-to-end tests for API anonymization and redaction features
RonShakutai 253c220
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai 4b82e84
Remove unused Ollama recognizer imports and related tests
RonShakutai 8c81ca7
Update requirements and improve Ollama recognizer availability checks…
RonShakutai d205959
Fix formatting in requirements.txt for analyzer and anonymizer depend…
RonShakutai 2ee5179
Update Ollama model ID from gemma3:1b to gemma2:2b in configuration a…
RonShakutai 1dc2328
gemma2:2b
RonShakutai 26d5097
finalizing the pr
RonShakutai 7684c79
Remove unused ABC import from lm_recognizer.py
RonShakutai 097ef84
Fix indentation in docker-compose.yml for volumes section
RonShakutai e693ca6
Fix line break for clarity in adding_recognizers.md
RonShakutai 95e559a
Add timeout settings for Ollama recognizer and test cases
RonShakutai 7b7f8fa
Refactor timeout comment for clarity in OllamaLangExtractRecognizer
RonShakutai f10a103
Update Ollama model version and add configuration for LangExtract rec…
RonShakutai 703aab6
Update examples_file path in configuration for Ollama recognizer
RonShakutai 034a1bf
Remove timeout decorator from Ollama recognizer
RonShakutai 768cb0e
Add rerun settings to unit and E2E tests for improved stability
RonShakutai 0f8fad6
Remove rerun settings from unit and E2E test commands for simplification
RonShakutai aed8a2d
Set max-parallel to 2 for local build and E2E tests
RonShakutai 5e4e0a0
Remove max-parallel setting from local build and E2E tests
RonShakutai 0220c5b
move poetry cache dir
tamirkamara f30bb7a
Merge branch 'tamirkamara/fix-disk-space' of https://github.com/micro…
RonShakutai 7b5f75a
Merge branch 'main' into feature-langextract
SharonHart 2e8165c
pr changes
RonShakutai 0ef9d31
code review changes
RonShakutai be7ee67
ruff check
RonShakutai 1e1e107
remove unused json import in test_ollama_recognizer.py
RonShakutai 083e853
refactor test names for clarity and consistency in test_ollama_recogn…
RonShakutai 3a339b7
finalizing the PR
RonShakutai 43a196e
Merge branch 'main' into feature-langextract
RonShakutai d377a56
self code review fixes
RonShakutai e27faa7
Merge branch 'feature-langextract' of https://github.com/microsoft/pr…
RonShakutai f9097ff
Refactor LangExtractRecognizer to use yaml for configuration loading
RonShakutai 68275e3
ruff fixes
RonShakutai 21fb33c
Update error messages in OllamaLangExtractRecognizer tests for clarity
RonShakutai 679b363
CR comment addressed
RonShakutai 73cb9d1
Remove unused variables from Jinja2 prompt rendering in LangExtractRe…
RonShakutai da1c563
exporting functionality to helpers enlarging composition
RonShakutai 974eff6
composition
RonShakutai 5039eb2
Refactor entity mapper and langextract utilities for improved clarity…
RonShakutai f78ec8e
Refactor tests to use get_langextract_module for mocking LangExtract …
RonShakutai b27b09c
Refactor langextract utilities to improve clarity and error handling;…
RonShakutai bb58002
Refactor LLM utilities by simplifying docstrings and consolidating im…
RonShakutai 08ef556
Update error message for missing Jinja2 installation to include poetr…
RonShakutai a2d4db7
Add Ollama recognizer configuration and tests for YAML integration
RonShakutai 09a02f4
Refactor docstrings in OllamaLangExtractRecognizer for improved clari…
RonShakutai da16ede
Enhance OllamaLangExtractRecognizer initialization docstring to clari…
RonShakutai 6f72919
Merge branch 'main' into feature-langextract
RonShakutai 7185449
pr comments
RonShakutai d45ff0c
Refactor OllamaLangExtractRecognizer to streamline config path handli…
RonShakutai 6b027bb
Merge branch 'feature-langextract' of https://github.com/microsoft/pr…
RonShakutai d5e05e0
Refactor Ollama recognizer test to improve clarity and enhance entity…
RonShakutai 651ecce
Refactor tests for Ollama recognizer and LMRecognizer to improve exce…
RonShakutai 95f5a6a
Update config path for Ollama recognizer in test configuration
RonShakutai 85f1c7a
Update config paths for Ollama recognizer and add test configuration …
RonShakutai f87ae78
Remove test configuration for Ollama LangExtract
RonShakutai 8158cfd
Remove test configuration for Ollama LangExtract recognizer
RonShakutai 025a5b0
Fix formatting in resolve_config_path function for improved readability
RonShakutai 0b6c62f
Enable UsLangExtractRecognizer and update its config path
RonShakutai 1fe8a2b
change all configs to use gemma3:1b
RonShakutai f492cf5
Disable Ollama LangExtract recognizer and update its configuration path
RonShakutai 94320e0
Update langextract configuration paths to use absolute paths for prom…
RonShakutai 4024901
Remove test script for Ollama recognizer configuration loading
RonShakutai 8397d43
Refactor config loading in examples and prompt loaders to use resolve…
RonShakutai 2b42d6f
Refactor parameter description in load_yaml_examples and clean up imp…
RonShakutai 6af3bb9
Update langextract paths to use repo-root-relative paths in tests and…
RonShakutai 4dfe7fd
Enhance documentation for Ollama setup and improve __init__.py import…
RonShakutai 54418fb
code review changes
RonShakutai dcb9f8c
Merge branch 'main' of https://github.com/microsoft/presidio into fea…
RonShakutai e682f60
pr comments & align to main
RonShakutai c43dd92
Merge branch 'main' into feature-langextract
RonShakutai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,181 @@ | ||
| # Language Model-based PII/PHI Detection (Experimental Feature) | ||
|
|
||
| ## Introduction | ||
|
|
||
| Presidio supports language model-based PII/PHI detection for flexible entity recognition using language models (LLMs, SLMs, etc.). This approach enables detection of both: | ||
| - **PII (Personally Identifiable Information)**: Names, emails, phone numbers, SSN, credit cards, etc. | ||
| - **PHI (Protected Health Information)**: Medical records, health identifiers, etc. | ||
|
|
||
| (The default approach uses [LangExtract](https://github.com/google/langextract) under the hood to integrate with language model providers.) | ||
|
|
||
| ## Entity Detection Capabilities | ||
|
|
||
| Unlike pattern-based recognizers, language model-based detection is flexible and depends on: | ||
|
|
||
| - The language model being used | ||
| - The prompt description provided | ||
| - The few-shot examples configured | ||
|
|
||
| The default configuration includes examples for common PII/PHI entities such as PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, MEDICAL_LICENSE, and more. | ||
| **You can customize the prompts and examples to detect any entity types relevant to your use case**. | ||
|
|
||
| For the default entity mappings and examples, see the [default configuration](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml). | ||
|
|
||
| ## Supported Language Model Providers | ||
|
|
||
| Presidio supports the following language model providers through LangExtract: | ||
|
|
||
| 1. **Ollama** - Local language model deployment (open-source models like Gemma, Llama, etc.) | ||
| 2. **Azure OpenAI** - _Documentation coming soon_ | ||
|
|
||
| ## Language Model-based Recognizer Implementation | ||
|
|
||
| Presidio provides a hierarchy of recognizers for language model-based PII/PHI detection: | ||
|
|
||
| - **`LMRecognizer`**: Abstract base class for all language model recognizers (LLMs, SLMs, etc.) | ||
| - **`LangExtractRecognizer`**: Abstract base class for LangExtract library integration (model-agnostic) | ||
| - **`OllamaLangExtractRecognizer`**: Concrete implementation for Ollama local language models | ||
| - **`AzureOpenAILangExtractRecognizer`**: _Documentation coming soon_ | ||
|
|
||
| [OllamaLangExtractRecognizer implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/third_party/ollama_langextract_recognizer.py) | ||
|
|
||
| --- | ||
|
|
||
| ## Using Ollama (Local Models) | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| 1. **Install Presidio with LangExtract support**: | ||
| ```sh | ||
| pip install presidio-analyzer[langextract] | ||
| ``` | ||
|
|
||
| 2. **Set up Ollama** | ||
|
|
||
| You have two options to set up Ollama: | ||
|
|
||
| **Option 1: Docker Compose** (recommended for CPU) | ||
|
|
||
| This option requires Docker to be installed on your system. | ||
|
|
||
| **Where to run:** From the root presidio directory (where `docker-compose.yml` is located) | ||
|
|
||
| ```bash | ||
| docker compose up -d ollama | ||
| docker exec presidio-ollama-1 ollama pull qwen2.5:1.5b | ||
| docker exec presidio-ollama-1 ollama list | ||
| ``` | ||
|
|
||
| **Platform differences:** | ||
| - **Linux/Mac**: Commands above work as-is | ||
| - **Windows**: Use PowerShell or CMD, commands are the same | ||
|
|
||
| If you don't have Docker installed: | ||
| - Linux: Follow [Docker installation guide](https://docs.docker.com/engine/install/) | ||
| - Mac: Install [Docker Desktop for Mac](https://docs.docker.com/desktop/install/mac-install/) | ||
| - Windows: Install [Docker Desktop for Windows](https://docs.docker.com/desktop/install/windows-install/) | ||
|
|
||
| **Option 2: Native installation** (recommended for GPU acceleration) | ||
|
|
||
| Follow the [official LangExtract Ollama guide](https://github.com/google/langextract?tab=readme-ov-file#using-local-llms-with-ollama). | ||
|
|
||
| After installation, pull and run the model: | ||
| ```bash | ||
| ollama pull qwen2.5:1.5b | ||
| ollama run qwen2.5:1.5b | ||
| ``` | ||
|
|
||
| > This option provides better performance with GPU acceleration (e.g., on Mac with Metal Performance Shaders or systems with NVIDIA GPUs). | ||
| > The model must be pulled and run before using the recognizer. The default model is `qwen2.5:1.5b`. | ||
|
|
||
| 3. **Configuration** (optional): Create your own `ollama_config.yaml` or use the [default configuration](https://github.com/microsoft/presidio/blob/main//presidio-analyzer/presidio_analyzer/conf/langextract_config_ollama.yaml) | ||
|
|
||
| ### Usage | ||
|
|
||
| **Option 1: Enable in configuration file** | ||
|
|
||
| Enable the recognizer in [`default_recognizers.yaml`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml): | ||
| ```yaml | ||
| - name: OllamaLangExtractRecognizer | ||
| enabled: true # Change from false to true | ||
| ``` | ||
|
|
||
| Then load the analyzer using this modified configuration file: | ||
|
|
||
| ```python | ||
| from presidio_analyzer import AnalyzerEngine | ||
| from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider | ||
|
|
||
| # Point to your modified default_recognizers.yaml with Ollama enabled | ||
| provider = RecognizerRegistryProvider( | ||
| conf_file="/path/to/your/modified/default_recognizers.yaml" | ||
| ) | ||
| registry = provider.create_recognizer_registry() | ||
|
|
||
| # Create analyzer with the registry that includes Ollama recognizer | ||
| analyzer = AnalyzerEngine(registry=registry, supported_languages=["en"]) | ||
|
|
||
| # Analyze text - Ollama recognizer will participate in detection | ||
| results = analyzer.analyze(text="My email is john.doe@example.com", language="en") | ||
| ``` | ||
|
|
||
| **Option 2: Add programmatically** | ||
|
|
||
| ```python | ||
| from presidio_analyzer import AnalyzerEngine | ||
| from presidio_analyzer.predefined_recognizers.third_party.ollama_langextract_recognizer import OllamaLangExtractRecognizer | ||
|
|
||
| analyzer = AnalyzerEngine() | ||
| analyzer.registry.add_recognizer(OllamaLangExtractRecognizer()) | ||
|
|
||
| results = analyzer.analyze(text="My email is john.doe@example.com", language="en") | ||
| ``` | ||
|
|
||
| !!! note "Note" | ||
| The recognizer is disabled by default in `default_recognizers.yaml` to avoid requiring Ollama for basic Presidio usage. Enable it when you have Ollama set up and running. | ||
|
|
||
| ### Custom Configuration | ||
|
|
||
| To use a custom configuration file: | ||
|
|
||
| ```python | ||
| analyzer.registry.add_recognizer( | ||
| OllamaLangExtractRecognizer(config_path="/path/to/custom_config.yaml") | ||
| ) | ||
| ``` | ||
|
|
||
| ### Configuration Options | ||
|
|
||
| The `langextract_config_ollama.yaml` file supports the following options: | ||
|
|
||
| - **`model_id`**: The Ollama model to use (default: `"qwen2.5:1.5b"`) | ||
| - **`model_url`**: Ollama server URL (default: `"http://localhost:11434"`) | ||
| - **`temperature`**: Model temperature for generation (default: `null` for model default) | ||
| - **`supported_entities`**: PII/PHI entity types to detect | ||
| - **`entity_mappings`**: Map LangExtract entity classes to Presidio entity names | ||
| - **`min_score`**: Minimum confidence score (default: `0.5`) | ||
|
|
||
| See the [configuration file](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/ollama_config.yaml) for all options. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| **ConnectionError: "Ollama server not reachable"** | ||
| - Ensure Ollama is running: `docker ps` or check `http://localhost:11434` | ||
| - Verify the `model_url` in your configuration matches your Ollama server address | ||
|
|
||
| **RuntimeError: "Model 'qwen2.5:1.5b' not found"** | ||
| - Pull the model: `docker exec -it presidio-ollama-1 ollama pull qwen2.5:1.5b` | ||
| - Or for manual setup: `ollama pull qwen2.5:1.5b` | ||
| - Verify the model name matches the `model_id` in your configuration | ||
|
|
||
| --- | ||
|
|
||
| ## Using Azure OpenAI (Cloud Models) | ||
|
|
||
| _Documentation coming soon_ | ||
|
|
||
| --- | ||
|
|
||
| ## Choosing Between Ollama and Azure OpenAI | ||
|
|
||
| _Comparison documentation coming soon_ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| requests>=2.32.4 | ||
| pytest | ||
| file:../presidio-analyzer | ||
| file:../presidio-anonymizer | ||
| -e ../presidio-analyzer[langextract] | ||
| -e ../presidio-anonymizer |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # LMRecognizer base configuration | ||
| lm_recognizer: | ||
| supported_entities: | ||
| - PERSON | ||
| - LOCATION | ||
| - ORGANIZATION | ||
| - PHONE_NUMBER | ||
| - EMAIL_ADDRESS | ||
| - DATE_TIME | ||
| - US_SSN | ||
| - CREDIT_CARD | ||
| - MEDICAL_LICENSE | ||
| - IP_ADDRESS | ||
| - URL | ||
| - IBAN_CODE | ||
|
|
||
| labels_to_ignore: | ||
| - payment_status | ||
|
|
||
| enable_generic_consolidation: true | ||
| min_score: 0.5 | ||
|
|
||
| langextract: | ||
| prompt_file: presidio-analyzer/presidio_analyzer/conf/langextract_prompts/default_pii_phi_prompt.j2 | ||
| examples_file: presidio-analyzer/presidio_analyzer/conf/langextract_prompts/default_pii_phi_examples.yaml | ||
|
|
||
| entity_mappings: | ||
| person: PERSON | ||
| full_name: PERSON | ||
| name_first: PERSON | ||
| name_last: PERSON | ||
| name_middle: PERSON | ||
| location: LOCATION | ||
| address: LOCATION | ||
| organization: ORGANIZATION | ||
| phone: PHONE_NUMBER | ||
| phone_number: PHONE_NUMBER | ||
| email: EMAIL_ADDRESS | ||
| date: DATE_TIME | ||
| ssn: US_SSN | ||
| identification_number: US_SSN | ||
| credit_card: CREDIT_CARD | ||
| medical_record: MEDICAL_LICENSE | ||
| ip_address: IP_ADDRESS | ||
| url: URL | ||
| iban: IBAN_CODE | ||
|
|
||
| model: | ||
| model_id: qwen2.5:1.5b | ||
| model_url: http://localhost:11434 | ||
| temperature: 0.0 |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.