diff --git a/.claude/README.md b/.claude/README.md new file mode 100644 index 0000000000..f2b72878a3 --- /dev/null +++ b/.claude/README.md @@ -0,0 +1,303 @@ +# Presidio Claude Code Plugin and Skills + +This directory contains Claude Code plugins and skills for working with Microsoft Presidio, a comprehensive Data Protection and De-identification SDK. + +## Overview + +The Presidio toolkit provides Claude with comprehensive capabilities for: +- Detecting PII (Personally Identifiable Information) in text, images, and structured data +- Anonymizing sensitive data using various operators (redact, mask, hash, encrypt, replace) +- Creating custom PII recognizers for domain-specific patterns +- Processing images and DICOM medical files +- Handling structured data (DataFrames, CSV, JSON, Parquet) + +## Directory Structure + +``` +.claude/ +├── README.md # This file +├── plugins/ +│ └── presidio-toolkit.json # Plugin definition +└── skills/ + └── presidio-pii-detection/ + ├── SKILL.md # Main skill instructions + ├── references/ + │ └── api_quick_reference.md # API documentation + └── scripts/ + ├── analyze_text.py # PII detection script + └── anonymize_text.py # Anonymization script +``` + +## Installation + +### Prerequisites + +```bash +# Install Presidio components +pip install presidio-analyzer presidio-anonymizer +python -m spacy download en_core_web_lg + +# Optional: Image redaction +pip install presidio-image-redactor + +# Optional: Structured data +pip install presidio-structured +``` + +### Installing the Plugin in Claude Code + +The plugin is automatically available when you're working in this repository. Claude Code will detect the `.claude` directory and load the plugin. + +To manually reference the skill, you can mention "presidio" or "PII" in your prompts, and Claude will automatically use the presidio-pii-detection skill. + +## Skills + +### presidio-pii-detection + +**Description**: Comprehensive skill for detecting, analyzing, and anonymizing PII in text, images, and structured data. + +**Capabilities**: +- Text PII detection and anonymization +- Custom recognizer creation +- Image and DICOM redaction +- Structured data anonymization +- Multi-language support +- Context-aware detection +- Reversible anonymization (encryption) + +**When to Use**: +- User mentions PII, sensitive data, or privacy +- User asks about redacting or anonymizing data +- User needs to detect credit cards, SSNs, emails, phone numbers, etc. +- User works with documents containing personal information +- User needs GDPR, HIPAA, or privacy compliance + +**Example Usage**: +``` +# Claude will automatically use this skill when you ask: +"Help me detect and anonymize PII in this text: My SSN is 123-45-6789" +"I need to redact sensitive information from this document" +"How can I create a custom recognizer for employee IDs?" +``` + +## Scripts + +### analyze_text.py + +Detect PII entities in text. + +```bash +# Analyze text +python .claude/skills/presidio-pii-detection/scripts/analyze_text.py "My phone is 555-1234" + +# Analyze from file +python .claude/skills/presidio-pii-detection/scripts/analyze_text.py --file input.txt + +# Detect specific entities +python .claude/skills/presidio-pii-detection/scripts/analyze_text.py \ + --entities PHONE_NUMBER EMAIL_ADDRESS \ + "Contact me at john@example.com or 555-1234" + +# Save results to file +python .claude/skills/presidio-pii-detection/scripts/analyze_text.py \ + --file input.txt --output results.json +``` + +### anonymize_text.py + +Anonymize PII in text. + +```bash +# Anonymize text (default: replace) +python .claude/skills/presidio-pii-detection/scripts/anonymize_text.py \ + "My SSN is 123-45-6789" + +# Redact PII +python .claude/skills/presidio-pii-detection/scripts/anonymize_text.py \ + --operator redact \ + "John's email is john@example.com" + +# Mask PII +python .claude/skills/presidio-pii-detection/scripts/anonymize_text.py \ + --operator mask \ + "My credit card is 4111-1111-1111-1111" + +# Hash PII +python .claude/skills/presidio-pii-detection/scripts/anonymize_text.py \ + --operator hash \ + "Patient SSN: 123-45-6789" + +# Process file +python .claude/skills/presidio-pii-detection/scripts/anonymize_text.py \ + --file input.txt --output anonymized.txt --operator redact +``` + +## Using the Skill in Claude Code + +The skill is automatically activated when you discuss topics related to: +- PII detection or anonymization +- Sensitive data or privacy +- Data protection or de-identification +- GDPR, HIPAA compliance +- Redacting or masking data +- Credit cards, SSNs, emails, phone numbers, etc. + +### Example Interactions + +**Basic Text Anonymization**: +``` +User: "I have a document with customer data. How can I anonymize phone numbers and emails?" + +Claude (using presidio-pii-detection skill): +"I'll help you anonymize phone numbers and emails using Presidio. Here's how..." +[Provides code example using AnalyzerEngine and AnonymizerEngine] +``` + +**Custom Recognizer**: +``` +User: "I need to detect custom employee IDs in the format EMP-123456" + +Claude (using presidio-pii-detection skill): +"I'll show you how to create a custom pattern recognizer for your employee ID format..." +[Provides PatternRecognizer example] +``` + +**Image Redaction**: +``` +User: "Can you help me redact PII from a scanned document?" + +Claude (using presidio-pii-detection skill): +"Yes, I can help you use Presidio's image redactor..." +[Provides ImageRedactorEngine example] +``` + +## Plugin Configuration + +The plugin is defined in `plugins/presidio-toolkit.json`: + +```json +{ + "name": "presidio-toolkit", + "version": "1.0.0", + "description": "Comprehensive toolkit for PII detection and anonymization", + "skills": [ + "../skills/presidio-pii-detection" + ] +} +``` + +## References + +- **Main Repository Guide**: See `../claude.md` for comprehensive repository documentation +- **API Quick Reference**: See `skills/presidio-pii-detection/references/api_quick_reference.md` +- **Presidio Documentation**: https://microsoft.github.io/presidio +- **Presidio Repository**: https://github.com/microsoft/presidio + +## Supported PII Entities + +### Global +- CREDIT_CARD, CRYPTO, DATE_TIME, EMAIL_ADDRESS, IBAN_CODE +- IP_ADDRESS, NRP, LOCATION, PERSON, PHONE_NUMBER +- MEDICAL_LICENSE, URL + +### Country-Specific +- **US**: US_SSN, US_PASSPORT, US_DRIVER_LICENSE, US_BANK_NUMBER, US_ITIN +- **UK**: UK_NHS, UK_NINO +- **Spain**: ES_NIF, ES_NIE +- **Italy**: IT_FISCAL_CODE, IT_DRIVER_LICENSE, IT_VAT_CODE, IT_PASSPORT, IT_IDENTITY_CARD +- **Others**: PL_PESEL, SG_NRIC_FIN, AU_ABN, IN_AADHAAR, FI_PERSONAL_IDENTITY_CODE, KR_RRN, TH_TNIN +- And many more... + +## Anonymization Operators + +- **replace** - Replace with placeholder or static value +- **redact** - Remove the PII entirely +- **hash** - One-way hash (permanent) +- **mask** - Mask characters with * or other character +- **encrypt** - Reversible encryption +- **keep** - Keep original (for allowlisting) + +## Multi-Language Support + +Supported languages: en, es, fr, de, it, pt, nl, he, ru, pl, zh, ja, ko, ar, th, fi, and more. + +## Best Practices + +1. **Test with sample data first** - Validate detection accuracy before production +2. **Use appropriate operators** - Encryption for reversible, hash for permanent +3. **Consider context** - Use context-aware recognizers for better accuracy +4. **Secure encryption keys** - Store keys securely, never hardcode +5. **Handle multiple languages** - Specify correct language code +6. **Validate results** - Review confidence scores and false positives +7. **Additional safeguards** - Presidio uses automation; add extra protections in production + +## Development + +### Adding New Scripts + +Add new scripts to `skills/presidio-pii-detection/scripts/`: +1. Create Python script with clear docstring +2. Make it executable: `chmod +x script.py` +3. Add usage examples in comments +4. Update this README + +### Adding New References + +Add new reference documents to `skills/presidio-pii-detection/references/`: +1. Create markdown file with clear structure +2. Focus on specific use cases or API details +3. Keep it concise and actionable +4. Update main SKILL.md if needed + +## Troubleshooting + +**Issue**: "Model 'en_core_web_lg' not found" +```bash +python -m spacy download en_core_web_lg +``` + +**Issue**: "presidio_analyzer module not found" +```bash +pip install presidio-analyzer presidio-anonymizer +``` + +**Issue**: Low detection accuracy +- Try using transformers NLP engine +- Lower score_threshold +- Add context words to recognizers +- Use appropriate language code + +**Issue**: False positives +- Implement allowlists +- Adjust score_threshold +- Use more specific entity types +- Add context awareness + +## License + +This plugin and skill follow the same MIT license as the Presidio repository. + +## Contributing + +To improve or extend these skills: +1. Edit `skills/presidio-pii-detection/SKILL.md` for instruction changes +2. Add scripts to `skills/presidio-pii-detection/scripts/` +3. Add reference docs to `skills/presidio-pii-detection/references/` +4. Update this README +5. Test thoroughly with Claude Code + +## Support + +For issues with: +- **Presidio Library**: https://github.com/microsoft/presidio/issues +- **This Plugin**: Create an issue in this repository +- **Claude Code**: Follow Claude Code documentation + +## Version History + +- **1.0.0** (2025-01-XX) - Initial release + - Comprehensive PII detection skill + - Text analysis and anonymization scripts + - API quick reference + - Multi-language support + - Custom recognizer examples diff --git a/.claude/plugins/presidio-toolkit.json b/.claude/plugins/presidio-toolkit.json new file mode 100644 index 0000000000..5039145aca --- /dev/null +++ b/.claude/plugins/presidio-toolkit.json @@ -0,0 +1,16 @@ +{ + "name": "presidio-toolkit", + "version": "1.0.0", + "description": "Comprehensive toolkit for PII detection and anonymization using Microsoft Presidio. Automatically detects and helps anonymize sensitive data in text, images, and structured data.", + "author": "Presidio Contributors", + "skills": [ + "../skills/presidio-pii-detection" + ], + "metadata": { + "repository": "https://github.com/microsoft/presidio", + "documentation": "https://microsoft.github.io/presidio", + "license": "MIT", + "tags": ["pii", "privacy", "security", "anonymization", "data-protection", "gdpr", "hipaa", "deidentification"], + "categories": ["data-protection", "security", "privacy"] + } +} diff --git a/.claude/skills/presidio-pii-detection/LICENSE.txt b/.claude/skills/presidio-pii-detection/LICENSE.txt new file mode 100644 index 0000000000..a22ce18576 --- /dev/null +++ b/.claude/skills/presidio-pii-detection/LICENSE.txt @@ -0,0 +1,29 @@ +MIT License + +Copyright (c) Microsoft Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +--- + +This skill is based on Microsoft Presidio: +https://github.com/microsoft/presidio + +The skill definition and related documentation are provided to help users +work with Presidio in Claude Code and are subject to the same MIT license. diff --git a/.claude/skills/presidio-pii-detection/SKILL.md b/.claude/skills/presidio-pii-detection/SKILL.md new file mode 100644 index 0000000000..3c71fe6640 --- /dev/null +++ b/.claude/skills/presidio-pii-detection/SKILL.md @@ -0,0 +1,512 @@ +--- +name: presidio-pii-detection +description: A comprehensive skill for detecting, analyzing, and anonymizing Personally Identifiable Information (PII) in text, images, and structured data using Microsoft Presidio. Use this skill when you need to identify sensitive data like credit cards, SSNs, emails, phone numbers, names, locations, or create custom PII recognizers. Also use for redacting PII from documents, images, or data frames. +--- + +# Presidio PII Detection and Anonymization Skill + +## Overview + +You are now equipped with comprehensive knowledge of Microsoft Presidio, a powerful Data Protection and De-identification SDK. This skill enables you to help users detect and anonymize PII in various data formats. + +## Core Capabilities + +### 1. Text PII Detection and Anonymization + +When a user needs to detect or anonymize PII in text: + +1. **Use AnalyzerEngine** to identify PII entities: +```python +from presidio_analyzer import AnalyzerEngine + +analyzer = AnalyzerEngine() +results = analyzer.analyze( + text="User provided text", + entities=["PHONE_NUMBER", "EMAIL_ADDRESS", "PERSON"], # Specify or use None for all + language='en', + score_threshold=0.5 # Adjust confidence threshold +) +``` + +2. **Use AnonymizerEngine** to de-identify detected PII: +```python +from presidio_anonymizer import AnonymizerEngine +from presidio_anonymizer.entities import OperatorConfig + +anonymizer = AnonymizerEngine() + +# With custom operators per entity type +operators = { + "PHONE_NUMBER": OperatorConfig("mask", {"chars_to_mask": 4, "masking_char": "*"}), + "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": ""}), + "PERSON": OperatorConfig("hash", {"hash_type": "sha256"}), + "CREDIT_CARD": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C&F)J"}), +} + +anonymized_result = anonymizer.anonymize( + text="User provided text", + analyzer_results=results, + operators=operators +) + +print(anonymized_result.text) +``` + +### 2. Supported PII Entity Types + +**Global Entities:** +- CREDIT_CARD - Credit card numbers with checksum validation +- CRYPTO - Cryptocurrency wallet addresses (Bitcoin) +- DATE_TIME - Dates, times, and time periods +- EMAIL_ADDRESS - Email addresses with RFC-822 validation +- IBAN_CODE - International Bank Account Numbers +- IP_ADDRESS - IPv4 and IPv6 addresses +- NRP - Nationality, religious, or political group +- LOCATION - Geographic locations (cities, countries, regions) +- PERSON - Full person names +- PHONE_NUMBER - Telephone numbers +- MEDICAL_LICENSE - Medical license numbers +- URL - Web URLs + +**US Entities:** +- US_BANK_NUMBER - Bank account numbers +- US_DRIVER_LICENSE - Driver's licenses +- US_ITIN - Individual Taxpayer ID Numbers +- US_PASSPORT - Passport numbers +- US_SSN - Social Security Numbers + +**Other Country-Specific:** +- UK_NHS, UK_NINO (UK) +- ES_NIF, ES_NIE (Spain) +- IT_FISCAL_CODE, IT_DRIVER_LICENSE, IT_VAT_CODE, IT_PASSPORT, IT_IDENTITY_CARD (Italy) +- PL_PESEL (Poland) +- SG_NRIC_FIN, SG_UEN (Singapore) +- AU_ABN, AU_ACN, AU_TFN, AU_MEDICARE (Australia) +- IN_PAN, IN_AADHAAR, IN_VEHICLE_REGISTRATION, IN_VOTER, IN_PASSPORT, IN_GSTIN (India) +- FI_PERSONAL_IDENTITY_CODE (Finland) +- KR_RRN (Korea) +- TH_TNIN (Thailand) + +### 3. Custom PII Recognizer Creation + +When users need to detect custom PII patterns: + +```python +from presidio_analyzer import PatternRecognizer, Pattern, AnalyzerEngine + +# Example: Detect custom employee IDs +employee_recognizer = PatternRecognizer( + supported_entity="EMPLOYEE_ID", + name="employee_id_recognizer", + patterns=[ + Pattern( + name="employee_id_pattern", + regex=r"EMP-\d{6}", + score=0.9 + ) + ], + context=["employee", "emp", "staff", "worker"] # Context words increase confidence +) + +# Add to analyzer +analyzer = AnalyzerEngine() +analyzer.registry.add_recognizer(employee_recognizer) + +# Use as normal +results = analyzer.analyze("My ID is EMP-123456", language='en') +``` + +### 4. Image PII Redaction + +When users need to redact PII from images: + +```python +from presidio_image_redactor import ImageRedactorEngine +from PIL import Image + +# Load image +image = Image.open("path/to/image.png") + +# Redact PII +engine = ImageRedactorEngine() +redacted_image = engine.redact( + image, + fill="background", # or "solid" for black boxes + padding_width=0.05 # Add padding around redacted areas +) + +# Save result +redacted_image.save("path/to/redacted_image.png") +``` + +For DICOM medical images: +```python +from presidio_image_redactor import DicomImageRedactorEngine + +engine = DicomImageRedactorEngine() +engine.redact_from_file( + input_dicom_path="input.dcm", + output_dicom_path="output.dcm", + fill="background" +) +``` + +### 5. Structured Data PII Detection + +When users need to anonymize DataFrames, CSV, JSON, or Parquet: + +```python +from presidio_structured import StructuredEngine, PandasAnalysisBuilder +import pandas as pd + +# Create sample DataFrame +df = pd.DataFrame({ + "name": ["John Doe", "Jane Smith"], + "email": ["john@example.com", "jane@example.com"], + "phone": ["212-555-5555", "415-555-1234"] +}) + +# Analyze and anonymize +engine = StructuredEngine() +anonymized_df = engine.anonymize( + data=df, + operators={ + "name": "mask", + "email": "replace", + "phone": "hash" + } +) +``` + +### 6. Anonymization Operators + +Available operators for different use cases: + +- **replace** - Replace with a static value or entity type placeholder + ```python + OperatorConfig("replace", {"new_value": ""}) + ``` + +- **redact** - Remove the PII entirely + ```python + OperatorConfig("redact", {}) + ``` + +- **hash** - Hash the PII (one-way, irreversible) + ```python + OperatorConfig("hash", {"hash_type": "sha256"}) # or "sha512", "md5" + ``` + +- **mask** - Mask characters with a masking character + ```python + OperatorConfig("mask", {"chars_to_mask": 4, "masking_char": "*", "from_end": True}) + ``` + +- **encrypt** - Encrypt the PII (reversible with key) + ```python + OperatorConfig("encrypt", {"key": "your-encryption-key-16-bytes"}) + ``` + +- **keep** - Keep the original value (useful for allowlisting) + ```python + OperatorConfig("keep", {}) + ``` + +- **custom** - Use custom anonymization logic + ```python + OperatorConfig("custom", {"lambda": lambda x: f"ANON_{len(x)}"}) + ``` + +### 7. Multi-Language Support + +When working with non-English text: + +```python +analyzer = AnalyzerEngine() + +# French +results = analyzer.analyze( + text="Mon numéro est 01 23 45 67 89", + language='fr' +) + +# Spanish +results = analyzer.analyze( + text="Mi correo es usuario@ejemplo.com", + language='es' +) + +# German +results = analyzer.analyze( + text="Meine Telefonnummer ist 030-12345678", + language='de' +) +``` + +Supported languages: en, es, fr, de, it, pt, nl, he, ru, pl, and more. + +### 8. Using Transformers NLP Engine + +For improved accuracy with transformer models: + +```python +from presidio_analyzer import AnalyzerEngine +from presidio_analyzer.nlp_engine import TransformersNlpEngine + +# Configure transformers model +model_config = [{ + "lang_code": "en", + "model_name": { + "spacy": "en_core_web_sm", + "transformers": "dslim/bert-base-NER" + } +}] + +nlp_engine = TransformersNlpEngine(models=model_config) +analyzer = AnalyzerEngine(nlp_engine=nlp_engine) + +results = analyzer.analyze(text="...", language='en') +``` + +### 9. Context-Aware Detection + +Improve detection accuracy by providing context: + +```python +from presidio_analyzer import PatternRecognizer, Pattern + +# Custom recognizer with context awareness +account_recognizer = PatternRecognizer( + supported_entity="ACCOUNT_NUMBER", + patterns=[Pattern("account_pattern", r"\d{8,12}", 0.4)], # Low initial score + context=["account", "account number", "acct", "acc#"], # Boost score when these appear nearby + context_similarity_factor=0.35 # How much to boost +) +``` + +### 10. De-anonymization (Reversible Operations) + +When encryption is used, de-anonymize later: + +```python +from presidio_anonymizer import DeanonymizeEngine +from presidio_anonymizer.entities import OperatorConfig + +# Anonymize with encryption +anonymizer = AnonymizerEngine() +encrypted_result = anonymizer.anonymize( + text="My SSN is 123-45-6789", + analyzer_results=results, + operators={"US_SSN": OperatorConfig("encrypt", {"key": "encryption-key-16b"})} +) + +# Later, de-anonymize +deanonymizer = DeanonymizeEngine() +original_result = deanonymizer.deanonymize( + text=encrypted_result.text, + entities=encrypted_result.items, + operators={"US_SSN": OperatorConfig("decrypt", {"key": "encryption-key-16b"})} +) +``` + +## Decision Making Guidelines + +### When to Use Which Component + +1. **Use presidio-analyzer** when: + - User needs to identify what PII exists in text + - User wants to scan documents for sensitive data + - User needs confidence scores for detected entities + +2. **Use presidio-anonymizer** when: + - User needs to de-identify detected PII + - User wants to redact, mask, or replace sensitive data + - User needs reversible anonymization (encryption) + +3. **Use presidio-image-redactor** when: + - User has images or PDFs with visible PII + - User needs to redact text from scanned documents + - User works with medical DICOM images + +4. **Use presidio-structured** when: + - User has CSV, JSON, Parquet, or DataFrame data + - User needs to anonymize entire datasets + - User wants to preserve data structure while removing PII + +5. **Create custom recognizers** when: + - User has organization-specific ID formats + - Standard recognizers miss domain-specific patterns + - User needs to detect non-standard PII types + +## Common Patterns and Solutions + +### Pattern 1: Full Text Anonymization Pipeline +```python +from presidio_analyzer import AnalyzerEngine +from presidio_anonymizer import AnonymizerEngine + +def anonymize_text(text, entities=None, language='en'): + """Complete anonymization pipeline.""" + analyzer = AnalyzerEngine() + anonymizer = AnonymizerEngine() + + # Detect PII + results = analyzer.analyze(text=text, entities=entities, language=language) + + # Anonymize + anonymized = anonymizer.anonymize(text=text, analyzer_results=results) + + return anonymized.text, results +``` + +### Pattern 2: Batch Processing Multiple Documents +```python +from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine +from presidio_anonymizer import AnonymizerEngine + +def batch_anonymize(documents): + """Process multiple documents efficiently.""" + analyzer = AnalyzerEngine() + batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer) + anonymizer = AnonymizerEngine() + + # Analyze all documents + batch_results = batch_analyzer.analyze_dict(documents, language='en') + + # Anonymize each + anonymized_docs = [] + for doc_id, results in batch_results.items(): + anonymized = anonymizer.anonymize( + text=documents[doc_id], + analyzer_results=results + ) + anonymized_docs.append(anonymized.text) + + return anonymized_docs +``` + +### Pattern 3: Allowlist Specific Values +```python +from presidio_analyzer import AnalyzerEngine +from presidio_anonymizer import AnonymizerEngine +from presidio_anonymizer.entities import OperatorConfig + +def anonymize_with_allowlist(text, allowed_emails): + """Don't anonymize specific allowed values.""" + analyzer = AnalyzerEngine() + anonymizer = AnonymizerEngine() + + results = analyzer.analyze(text=text, language='en') + + # Filter out allowed values + filtered_results = [ + r for r in results + if not (r.entity_type == "EMAIL_ADDRESS" and text[r.start:r.end] in allowed_emails) + ] + + anonymized = anonymizer.anonymize(text=text, analyzer_results=filtered_results) + return anonymized.text +``` + +### Pattern 4: Different Anonymization Per Entity Type +```python +from presidio_anonymizer import AnonymizerEngine +from presidio_anonymizer.entities import OperatorConfig + +def smart_anonymize(text, results): + """Apply different strategies based on entity type.""" + anonymizer = AnonymizerEngine() + + operators = { + "PERSON": OperatorConfig("replace", {"new_value": ""}), + "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 5, "masking_char": "*"}), + "PHONE_NUMBER": OperatorConfig("redact", {}), + "CREDIT_CARD": OperatorConfig("hash", {"hash_type": "sha256"}), + "US_SSN": OperatorConfig("encrypt", {"key": "secure-key-here16"}), + } + + return anonymizer.anonymize(text=text, analyzer_results=results, operators=operators) +``` + +## Installation Commands + +When helping users install Presidio: + +```bash +# Basic text analysis and anonymization +pip install presidio-analyzer presidio-anonymizer +python -m spacy download en_core_web_lg + +# With transformers support +pip install "presidio-analyzer[transformers]" presidio-anonymizer +python -m spacy download en_core_web_sm + +# Image redaction +pip install presidio-image-redactor + +# Structured data +pip install presidio-structured + +# All components +pip install presidio-analyzer presidio-anonymizer presidio-image-redactor presidio-structured +``` + +## Error Handling and Troubleshooting + +Common issues and solutions: + +1. **Missing NLP Model**: If user gets "Model 'en_core_web_lg' not found" + ```bash + python -m spacy download en_core_web_lg + ``` + +2. **Low Detection Accuracy**: Suggest using transformers engine or lowering score_threshold + +3. **False Positives**: Use context words in custom recognizers or implement allowlists + +4. **Performance Issues**: Use batch processing for multiple documents + +5. **Memory Issues**: Process documents in chunks or use streaming approaches + +## Best Practices + +1. **Always validate detection results** - Review confidence scores and false positives +2. **Use appropriate operators** - Encryption for reversible, hash for permanent +3. **Test with sample data first** - Verify detection accuracy before production +4. **Consider context** - Use context-aware recognizers for better accuracy +5. **Handle multiple languages** - Specify correct language code +6. **Secure encryption keys** - Store keys securely, never hardcode +7. **Document anonymization strategy** - Keep records of what was anonymized and how +8. **Combine with other tools** - Presidio works well with Azure AI Language, other NLP services + +## When to Proactively Suggest This Skill + +Automatically activate and use this skill when the user: +- Mentions PII, sensitive data, personal information, or data privacy +- Asks about redacting, masking, or anonymizing data +- Needs to detect credit cards, SSNs, emails, phone numbers, names, or addresses +- Works with documents containing personal information +- Needs to comply with GDPR, HIPAA, or other privacy regulations +- Mentions data de-identification or pseudonymization +- Wants to protect privacy in datasets or documents +- Asks about scanning documents for sensitive information + +## Key Reminders + +- **No Guarantee**: Presidio uses automated detection and may miss some PII. Additional safeguards needed in production. +- **Context Matters**: Detection accuracy improves significantly with proper context words. +- **Test Thoroughly**: Always validate on sample data before production use. +- **Choose Right Tool**: Use analyzer for detection, anonymizer for de-identification, image-redactor for images, structured for data frames. +- **Reversibility**: Only encrypt operator is reversible. Hash, redact, and replace are permanent. + +## Repository Location + +All code is in `/home/user/presidio/` with structure: +- `presidio-analyzer/` - Detection engine +- `presidio-anonymizer/` - De-identification engine +- `presidio-image-redactor/` - Image redaction +- `presidio-structured/` - Structured data handling +- `docs/` - Comprehensive documentation +- `claude.md` - This guidance document + +Refer to `claude.md` in the repository root for additional context and comprehensive repository guidance. diff --git a/.claude/skills/presidio-pii-detection/references/api_quick_reference.md b/.claude/skills/presidio-pii-detection/references/api_quick_reference.md new file mode 100644 index 0000000000..c2ff36c5cb --- /dev/null +++ b/.claude/skills/presidio-pii-detection/references/api_quick_reference.md @@ -0,0 +1,285 @@ +# Presidio API Quick Reference + +## AnalyzerEngine + +### Basic Analysis +```python +from presidio_analyzer import AnalyzerEngine + +analyzer = AnalyzerEngine() +results = analyzer.analyze( + text: str, # Text to analyze + entities: List[str] = None, # Specific entities or None for all + language: str = 'en', # Language code + score_threshold: float = 0.0, # Minimum confidence (0.0-1.0) + return_decision_process: bool = False, + ad_hoc_recognizers: List[EntityRecognizer] = None, + context: List[str] = None, + allow_list: List[str] = None +) +``` + +### RecognizerResult Properties +```python +result.entity_type # e.g., "PHONE_NUMBER" +result.start # Start position in text +result.end # End position in text +result.score # Confidence score (0.0-1.0) +result.analysis_explanation # Detection reasoning +result.recognition_metadata # Additional metadata +``` + +## AnonymizerEngine + +### Basic Anonymization +```python +from presidio_anonymizer import AnonymizerEngine +from presidio_anonymizer.entities import OperatorConfig + +anonymizer = AnonymizerEngine() +result = anonymizer.anonymize( + text: str, + analyzer_results: List[RecognizerResult], + operators: Dict[str, OperatorConfig] = None, # Per-entity operators + conflict_resolution: str = "merge_similar_or_contained" +) + +# Result properties +result.text # Anonymized text +result.items # List of anonymized entities with metadata +``` + +### Operator Configuration +```python +# Replace +OperatorConfig("replace", {"new_value": ""}) + +# Redact +OperatorConfig("redact", {}) + +# Hash +OperatorConfig("hash", {"hash_type": "sha256"}) # or "sha512", "md5" + +# Mask +OperatorConfig("mask", { + "chars_to_mask": int, # Number of characters to mask + "masking_char": str, # Character to use (default "*") + "from_end": bool # Mask from end (default True) +}) + +# Encrypt (reversible) +OperatorConfig("encrypt", {"key": "16-byte-key-here"}) + +# Keep (no anonymization) +OperatorConfig("keep", {}) +``` + +## Custom Pattern Recognizer + +```python +from presidio_analyzer import PatternRecognizer, Pattern + +recognizer = PatternRecognizer( + supported_entity: str, # e.g., "EMPLOYEE_ID" + name: str = None, # Recognizer name + patterns: List[Pattern], # List of patterns to match + context: List[str] = None, # Context words to boost score + supported_language: str = "en", + context_similarity_factor: float = 0.35 +) + +pattern = Pattern( + name: str, # Pattern identifier + regex: str, # Regular expression + score: float # Confidence score (0.0-1.0) +) +``` + +## ImageRedactorEngine + +```python +from presidio_image_redactor import ImageRedactorEngine +from PIL import Image + +engine = ImageRedactorEngine() +redacted_image = engine.redact( + image: Image, + fill: str = "background", # "background" or "solid" + padding_width: float = 0.05, # Padding around redactions + entities: List[str] = None, # Specific entities or None + **text_analyzer_kwargs # Pass to analyzer +) +``` + +## DICOM Image Redaction + +```python +from presidio_image_redactor import DicomImageRedactorEngine + +engine = DicomImageRedactorEngine() +engine.redact_from_file( + input_dicom_path: str, + output_dicom_path: str, + fill: str = "background", + padding_width: int = 25, + crop_ratio: float = 0.75, + use_metadata: bool = True +) +``` + +## StructuredEngine (DataFrames) + +```python +from presidio_structured import StructuredEngine +import pandas as pd + +engine = StructuredEngine() + +# Anonymize DataFrame +anonymized_df = engine.anonymize( + data: pd.DataFrame, + operators: Dict[str, Union[str, OperatorConfig]], + entities: List[str] = None, + language: str = 'en' +) +``` + +## BatchAnalyzerEngine + +```python +from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine + +analyzer = AnalyzerEngine() +batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer) + +# Analyze dictionary of documents +results = batch_analyzer.analyze_dict( + input_dict: Dict[str, Any], + language: str = 'en', + **kwargs +) + +# Analyze list of documents +results = batch_analyzer.analyze_iterator( + texts: Iterator[Union[str, Dict]], + language: str = 'en', + **kwargs +) +``` + +## TransformersNlpEngine + +```python +from presidio_analyzer import AnalyzerEngine +from presidio_analyzer.nlp_engine import TransformersNlpEngine + +model_config = [{ + "lang_code": "en", + "model_name": { + "spacy": "en_core_web_sm", # For lemmas, tokens + "transformers": "dslim/bert-base-NER" # For NER + } +}] + +nlp_engine = TransformersNlpEngine(models=model_config) +analyzer = AnalyzerEngine(nlp_engine=nlp_engine) +``` + +## DeanonymizeEngine (Reversible Operations) + +```python +from presidio_anonymizer import DeanonymizeEngine +from presidio_anonymizer.entities import OperatorConfig + +deanonymizer = DeanonymizeEngine() +original_result = deanonymizer.deanonymize( + text: str, + entities: List[OperatorResult], # From anonymizer result.items + operators: Dict[str, OperatorConfig] +) +``` + +## Entity Registry Management + +```python +# Add custom recognizer +analyzer.registry.add_recognizer(custom_recognizer) + +# Remove recognizer +analyzer.registry.remove_recognizer(recognizer_name) + +# Load all recognizers +recognizers = analyzer.registry.load_predefined_recognizers() + +# Get supported entities +entities = analyzer.registry.get_supported_entities() +``` + +## Common Entity Types + +**Global**: CREDIT_CARD, CRYPTO, DATE_TIME, EMAIL_ADDRESS, IBAN_CODE, IP_ADDRESS, NRP, LOCATION, PERSON, PHONE_NUMBER, MEDICAL_LICENSE, URL + +**US**: US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_PASSPORT, US_SSN + +**UK**: UK_NHS, UK_NINO + +**Spain**: ES_NIF, ES_NIE + +**Italy**: IT_FISCAL_CODE, IT_DRIVER_LICENSE, IT_VAT_CODE, IT_PASSPORT, IT_IDENTITY_CARD + +**Other**: PL_PESEL, SG_NRIC_FIN, SG_UEN, AU_ABN, AU_ACN, AU_TFN, AU_MEDICARE, IN_PAN, IN_AADHAAR, FI_PERSONAL_IDENTITY_CODE, KR_RRN, TH_TNIN + +## Language Codes + +en (English), es (Spanish), fr (French), de (German), it (Italian), pt (Portuguese), nl (Dutch), he (Hebrew), ru (Russian), pl (Polish), zh (Chinese), ja (Japanese), ko (Korean), ar (Arabic), th (Thai), fi (Finnish) + +## Common Patterns + +### Complete Pipeline +```python +analyzer = AnalyzerEngine() +anonymizer = AnonymizerEngine() + +results = analyzer.analyze(text=text, language='en') +anonymized = anonymizer.anonymize(text=text, analyzer_results=results) +``` + +### With Custom Operators +```python +operators = { + "PERSON": OperatorConfig("replace", {"new_value": ""}), + "EMAIL_ADDRESS": OperatorConfig("mask", {"chars_to_mask": 5}), + "CREDIT_CARD": OperatorConfig("hash", {"hash_type": "sha256"}), +} +anonymized = anonymizer.anonymize(text=text, analyzer_results=results, operators=operators) +``` + +### Filtered Analysis +```python +# Only detect specific entities +results = analyzer.analyze( + text=text, + entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"], + language='en' +) +``` + +### With Allowlist +```python +results = analyzer.analyze( + text=text, + allow_list=["support@company.com", "John Public"], + language='en' +) +``` + +### Image to Redacted Image +```python +from PIL import Image +from presidio_image_redactor import ImageRedactorEngine + +image = Image.open("input.png") +engine = ImageRedactorEngine() +redacted = engine.redact(image, fill="background") +redacted.save("output.png") +``` diff --git a/.claude/skills/presidio-pii-detection/scripts/analyze_text.py b/.claude/skills/presidio-pii-detection/scripts/analyze_text.py new file mode 100755 index 0000000000..135b5ece49 --- /dev/null +++ b/.claude/skills/presidio-pii-detection/scripts/analyze_text.py @@ -0,0 +1,109 @@ +#!/usr/bin/env python3 +""" +Analyze text for PII entities using Presidio Analyzer. + +Usage: + python analyze_text.py "Text to analyze" + python analyze_text.py --file input.txt + python analyze_text.py --entities PHONE_NUMBER EMAIL_ADDRESS "My phone is 555-1234" +""" + +import argparse +import json +import sys +from typing import List, Optional + + +def analyze_text( + text: str, + entities: Optional[List[str]] = None, + language: str = "en", + score_threshold: float = 0.5 +): + """Analyze text for PII entities.""" + try: + from presidio_analyzer import AnalyzerEngine + + analyzer = AnalyzerEngine() + results = analyzer.analyze( + text=text, + entities=entities, + language=language, + score_threshold=score_threshold + ) + + # Format results as JSON + results_json = [ + { + "entity_type": r.entity_type, + "start": r.start, + "end": r.end, + "score": r.score, + "text": text[r.start:r.end] + } + for r in results + ] + + return { + "text": text, + "language": language, + "entities_found": len(results_json), + "results": results_json + } + + except ImportError: + return { + "error": "presidio-analyzer not installed. Run: pip install presidio-analyzer" + } + except Exception as e: + return { + "error": str(e) + } + + +def main(): + parser = argparse.ArgumentParser(description="Analyze text for PII entities") + parser.add_argument("text", nargs="?", help="Text to analyze") + parser.add_argument("--file", "-f", help="Read text from file") + parser.add_argument("--entities", "-e", nargs="+", help="Specific entities to detect") + parser.add_argument("--language", "-l", default="en", help="Language code (default: en)") + parser.add_argument("--threshold", "-t", type=float, default=0.5, help="Score threshold (default: 0.5)") + parser.add_argument("--output", "-o", help="Output file for results") + + args = parser.parse_args() + + # Get text from argument or file + if args.file: + try: + with open(args.file, 'r', encoding='utf-8') as f: + text = f.read() + except Exception as e: + print(f"Error reading file: {e}", file=sys.stderr) + sys.exit(1) + elif args.text: + text = args.text + else: + parser.print_help() + sys.exit(1) + + # Analyze + results = analyze_text( + text=text, + entities=args.entities, + language=args.language, + score_threshold=args.threshold + ) + + # Output + output_json = json.dumps(results, indent=2) + + if args.output: + with open(args.output, 'w', encoding='utf-8') as f: + f.write(output_json) + print(f"Results written to {args.output}") + else: + print(output_json) + + +if __name__ == "__main__": + main() diff --git a/.claude/skills/presidio-pii-detection/scripts/anonymize_text.py b/.claude/skills/presidio-pii-detection/scripts/anonymize_text.py new file mode 100755 index 0000000000..0caccce2d2 --- /dev/null +++ b/.claude/skills/presidio-pii-detection/scripts/anonymize_text.py @@ -0,0 +1,147 @@ +#!/usr/bin/env python3 +""" +Anonymize PII in text using Presidio. + +Usage: + python anonymize_text.py "My SSN is 123-45-6789" + python anonymize_text.py --file input.txt --output output.txt + python anonymize_text.py --operator redact "John's email is john@example.com" +""" + +import argparse +import json +import sys +from typing import Optional + + +def anonymize_text( + text: str, + operator: str = "replace", + entities: Optional[list] = None, + language: str = "en", + operator_params: Optional[dict] = None +): + """Anonymize PII in text.""" + try: + from presidio_analyzer import AnalyzerEngine + from presidio_anonymizer import AnonymizerEngine + from presidio_anonymizer.entities import OperatorConfig + + # Analyze + analyzer = AnalyzerEngine() + results = analyzer.analyze( + text=text, + entities=entities, + language=language + ) + + if not results: + return { + "original_text": text, + "anonymized_text": text, + "entities_found": 0, + "message": "No PII detected" + } + + # Anonymize + anonymizer = AnonymizerEngine() + + # Build operator config + if operator_params: + operator_config = OperatorConfig(operator, operator_params) + else: + operator_config = OperatorConfig(operator, {}) + + # Apply same operator to all entities + operators = {r.entity_type: operator_config for r in results} + + anonymized = anonymizer.anonymize( + text=text, + analyzer_results=results, + operators=operators + ) + + return { + "original_text": text, + "anonymized_text": anonymized.text, + "entities_found": len(results), + "operator": operator, + "entities": [ + { + "entity_type": r.entity_type, + "start": r.start, + "end": r.end, + "original_text": text[r.start:r.end] + } + for r in results + ] + } + + except ImportError as e: + return { + "error": f"Missing package: {e}. Install with: pip install presidio-analyzer presidio-anonymizer" + } + except Exception as e: + return { + "error": str(e) + } + + +def main(): + parser = argparse.ArgumentParser(description="Anonymize PII in text") + parser.add_argument("text", nargs="?", help="Text to anonymize") + parser.add_argument("--file", "-f", help="Read text from file") + parser.add_argument("--output", "-o", help="Output file for anonymized text") + parser.add_argument("--operator", default="replace", + choices=["replace", "redact", "hash", "mask", "encrypt"], + help="Anonymization operator (default: replace)") + parser.add_argument("--entities", "-e", nargs="+", help="Specific entities to anonymize") + parser.add_argument("--language", "-l", default="en", help="Language code (default: en)") + parser.add_argument("--json", action="store_true", help="Output as JSON") + + args = parser.parse_args() + + # Get text + if args.file: + try: + with open(args.file, 'r', encoding='utf-8') as f: + text = f.read() + except Exception as e: + print(f"Error reading file: {e}", file=sys.stderr) + sys.exit(1) + elif args.text: + text = args.text + else: + parser.print_help() + sys.exit(1) + + # Anonymize + result = anonymize_text( + text=text, + operator=args.operator, + entities=args.entities, + language=args.language + ) + + # Check for errors + if "error" in result: + print(f"Error: {result['error']}", file=sys.stderr) + sys.exit(1) + + # Output + if args.output: + with open(args.output, 'w', encoding='utf-8') as f: + if args.json: + f.write(json.dumps(result, indent=2)) + else: + f.write(result['anonymized_text']) + print(f"Anonymized text written to {args.output}") + else: + if args.json: + print(json.dumps(result, indent=2)) + else: + print(result['anonymized_text']) + + +if __name__ == "__main__": + main() diff --git a/claude.md b/claude.md new file mode 100644 index 0000000000..217f43b3da --- /dev/null +++ b/claude.md @@ -0,0 +1,516 @@ +# Presidio - Claude Code Guide + +## Overview + +**Presidio** (Latin for "protection, garrison") is Microsoft's comprehensive Data Protection and De-identification SDK. It provides fast, context-aware, pluggable, and customizable PII (Personally Identifiable Information) detection and anonymization services for text, images, and structured data. + +**Repository**: https://github.com/microsoft/presidio +**Documentation**: https://microsoft.github.io/presidio +**Demo**: https://aka.ms/presidio-demo + +## Key Features + +1. **Predefined & Custom PII Recognizers** - Leverages NER, regex, rule-based logic, and checksums with contextual awareness in multiple languages +2. **External Model Integration** - Connect to third-party PII detection models +3. **Multiple Deployment Options** - Python packages, PySpark, Docker, Kubernetes +4. **Highly Customizable** - Extensible PII identification and de-identification +5. **Image Redaction** - Redact PII from standard images and DICOM medical images + +## Repository Structure + +``` +presidio/ +├── presidio-analyzer/ # PII identification in text +├── presidio-anonymizer/ # De-identify detected PII entities +├── presidio-image-redactor/ # Redact PII from images using OCR +├── presidio-structured/ # PII identification in structured/semi-structured data +├── presidio-cli/ # Command-line interface +├── docs/ # Comprehensive documentation +├── e2e-tests/ # End-to-end tests +├── .devcontainer/ # Development container configuration +└── docker-compose*.yml # Docker deployment configurations +``` + +## Core Components + +### 1. Presidio Analyzer + +**Purpose**: Identify PII entities in text using NLP and pattern matching. + +**Key Classes**: +- `AnalyzerEngine` - Main entry point for PII analysis +- `RecognizerResult` - Contains detected PII information +- `EntityRecognizer` - Base class for custom recognizers +- `PatternRecognizer` - Pattern-based recognition +- `AnalyzerEngineProvider` - Factory for creating analyzer instances + +**Basic Usage**: +```python +from presidio_analyzer import AnalyzerEngine + +analyzer = AnalyzerEngine() +results = analyzer.analyze( + text="My phone number is 212-555-5555", + entities=["PHONE_NUMBER"], + language='en' +) +``` + +**Supported Entities**: CREDIT_CARD, CRYPTO, DATE_TIME, EMAIL_ADDRESS, IBAN_CODE, IP_ADDRESS, NRP, LOCATION, PERSON, PHONE_NUMBER, MEDICAL_LICENSE, URL, US_SSN, US_PASSPORT, UK_NHS, and many more (see docs/supported_entities.md for full list) + +**Location**: `presidio-analyzer/presidio_analyzer/` + +### 2. Presidio Anonymizer + +**Purpose**: De-identify detected PII using various operators (redact, replace, encrypt, hash, mask, etc.). + +**Key Classes**: +- `AnonymizerEngine` - Main entry point for anonymization +- `OperatorConfig` - Configuration for anonymization operators +- `DeanonymizeEngine` - Reverse anonymization when using reversible operators + +**Basic Usage**: +```python +from presidio_anonymizer import AnonymizerEngine + +anonymizer = AnonymizerEngine() +anonymized = anonymizer.anonymize( + text="My phone number is 212-555-5555", + analyzer_results=results +) +``` + +**Anonymization Operators**: +- `replace` - Replace with a new value +- `redact` - Remove the PII +- `hash` - Hash the PII +- `mask` - Mask characters +- `encrypt` - Encrypt the PII (reversible) +- `custom` - Custom anonymization logic + +**Location**: `presidio-anonymizer/presidio_anonymizer/` + +### 3. Presidio Image Redactor + +**Purpose**: Detect and redact PII from images using OCR and the analyzer. + +**Key Classes**: +- `ImageRedactorEngine` - Main entry point for image redaction +- `ImageAnalyzerEngine` - Analyze images for PII +- `DicomImageRedactorEngine` - Specialized for DICOM medical images + +**Basic Usage**: +```python +from presidio_image_redactor import ImageRedactorEngine +from PIL import Image + +engine = ImageRedactorEngine() +image = Image.open("image_with_pii.png") +redacted_image = engine.redact(image) +``` + +**Location**: `presidio-image-redactor/presidio_image_redactor/` + +### 4. Presidio Structured + +**Purpose**: Detect and anonymize PII in structured/semi-structured data (JSON, CSV, Parquet, DataFrames). + +**Key Classes**: +- `StructuredEngine` - Main entry point for structured data +- `BatchAnalyzerEngine` - Batch analysis for structured data +- `DictAnalyzerResult` - Results for dictionary/JSON data + +**Basic Usage**: +```python +from presidio_structured import StructuredEngine +import pandas as pd + +engine = StructuredEngine() +df = pd.DataFrame({"name": ["John"], "phone": ["212-555-5555"]}) +anonymized_df = engine.anonymize(df) +``` + +**Location**: `presidio-structured/presidio_structured/` + +### 5. Presidio CLI + +**Purpose**: Command-line interface for quick PII detection and anonymization. + +**Location**: `presidio-cli/` + +## Installation + +### Using pip (Python 3.8 - 3.13): +```bash +# Analyzer with default spaCy model +pip install presidio-analyzer +python -m spacy download en_core_web_lg + +# Anonymizer +pip install presidio-anonymizer + +# Image Redactor +pip install presidio-image-redactor + +# Structured +pip install presidio-structured + +# CLI +pip install presidio-cli +``` + +### Using Docker: +```bash +# Pull images +docker pull mcr.microsoft.com/presidio-analyzer +docker pull mcr.microsoft.com/presidio-anonymizer +docker pull mcr.microsoft.com/presidio-image-redactor + +# Run with docker-compose +docker-compose up +``` + +### From Source: +```bash +# Install all components in development mode +pip install -e ./presidio-analyzer +pip install -e ./presidio-anonymizer +pip install -e ./presidio-image-redactor +pip install -e ./presidio-structured +pip install -e ./presidio-cli +``` + +## Common Workflows + +### Text Anonymization Workflow + +```python +from presidio_analyzer import AnalyzerEngine +from presidio_anonymizer import AnonymizerEngine + +# 1. Initialize engines +analyzer = AnalyzerEngine() +anonymizer = AnonymizerEngine() + +# 2. Analyze text for PII +text = "My name is John Doe and my email is john@example.com" +results = analyzer.analyze(text=text, language='en') + +# 3. Anonymize detected PII +anonymized = anonymizer.anonymize(text=text, analyzer_results=results) + +print(f"Original: {text}") +print(f"Anonymized: {anonymized.text}") +``` + +### Custom Recognizer Workflow + +```python +from presidio_analyzer import PatternRecognizer, Pattern + +# Create custom recognizer for employee IDs +employee_id_recognizer = PatternRecognizer( + supported_entity="EMPLOYEE_ID", + patterns=[Pattern("emp_id_pattern", r"EMP-\d{6}", 0.9)] +) + +# Add to analyzer +analyzer = AnalyzerEngine() +analyzer.registry.add_recognizer(employee_id_recognizer) + +# Use analyzer +results = analyzer.analyze("My ID is EMP-123456", language='en') +``` + +### Image Redaction Workflow + +```python +from presidio_image_redactor import ImageRedactorEngine +from PIL import Image + +# Load image +image = Image.open("document.png") + +# Redact PII +engine = ImageRedactorEngine() +redacted_image = engine.redact(image) + +# Save result +redacted_image.save("document_redacted.png") +``` + +## Development Setup + +### Prerequisites +- Python 3.8 - 3.13 +- Docker (optional, for containerized development) +- VS Code (optional, .devcontainer available) + +### Setting Up Development Environment + +```bash +# 1. Clone repository +git clone https://github.com/microsoft/presidio.git +cd presidio + +# 2. Create virtual environment +python -m venv venv +source venv/bin/activate # On Windows: venv\Scripts\activate + +# 3. Install dependencies for all modules +pip install -e ./presidio-analyzer[dev] +pip install -e ./presidio-anonymizer[dev] +pip install -e ./presidio-image-redactor[dev] +pip install -e ./presidio-structured[dev] + +# 4. Install pre-commit hooks +pip install pre-commit +pre-commit install + +# 5. Download NLP models +python -m spacy download en_core_web_lg +``` + +### Using Dev Container +```bash +# Open in VS Code with Remote-Containers extension +# The .devcontainer configuration will handle setup automatically +``` + +## Testing + +### Running Tests + +```bash +# Run tests for a specific module +cd presidio-analyzer +pytest + +# Run with coverage +pytest --cov=presidio_analyzer --cov-report=html + +# Run all tests from root +pytest presidio-analyzer/tests +pytest presidio-anonymizer/tests +pytest presidio-image-redactor/tests +pytest presidio-structured/tests + +# Run end-to-end tests +cd e2e-tests +pytest +``` + +### Running Linter/Formatter +```bash +# Format code with ruff +ruff format . + +# Lint code +ruff check . + +# Run pre-commit checks +pre-commit run --all-files +``` + +## Docker Deployment + +### Single Module +```bash +# Build analyzer +docker build -f presidio-analyzer/Dockerfile -t presidio-analyzer . + +# Run analyzer +docker run -p 5002:3000 presidio-analyzer +``` + +### Full Stack +```bash +# Start all services +docker-compose up + +# Analyzer API: http://localhost:5002 +# Anonymizer API: http://localhost:5001 +``` + +### Using the REST API + +```bash +# Analyze text +curl -X POST http://localhost:5002/analyze \ + -H "Content-Type: application/json" \ + -d '{"text": "My SSN is 123-45-6789", "language": "en"}' + +# Anonymize text +curl -X POST http://localhost:5001/anonymize \ + -H "Content-Type: application/json" \ + -d '{ + "text": "My SSN is 123-45-6789", + "analyzer_results": [ + {"start": 10, "end": 21, "score": 0.95, "entity_type": "US_SSN"} + ] + }' +``` + +## Key Configuration Options + +### Analyzer Configuration + +```python +from presidio_analyzer import AnalyzerEngine +from presidio_analyzer.nlp_engine import NlpEngineProvider + +# Custom NLP engine (e.g., transformers) +nlp_config = { + "nlp_engine_name": "transformers", + "models": [{"lang_code": "en", "model_name": "dslim/bert-base-NER"}] +} +nlp_engine = NlpEngineProvider(nlp_configuration=nlp_config).create_engine() + +analyzer = AnalyzerEngine(nlp_engine=nlp_engine) + +# Analyze with specific entities only +results = analyzer.analyze( + text="...", + entities=["PHONE_NUMBER", "EMAIL_ADDRESS"], + language='en', + score_threshold=0.7 # Minimum confidence score +) +``` + +### Anonymizer Configuration + +```python +from presidio_anonymizer import AnonymizerEngine +from presidio_anonymizer.entities import OperatorConfig + +anonymizer = AnonymizerEngine() + +# Custom operators per entity type +operators = { + "PHONE_NUMBER": OperatorConfig("mask", {"chars_to_mask": 4, "masking_char": "*"}), + "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": ""}), + "PERSON": OperatorConfig("hash", {"hash_type": "sha256"}), +} + +result = anonymizer.anonymize( + text=text, + analyzer_results=results, + operators=operators +) +``` + +## Multi-Language Support + +Presidio supports multiple languages. Key languages include: +- English (en) +- Spanish (es) +- French (fr) +- German (de) +- Italian (it) +- Portuguese (pt) +- Dutch (nl) +- Hebrew (he) +- Russian (ru) +- Polish (pl) +- And more... + +```python +# Using a different language +analyzer = AnalyzerEngine() +results = analyzer.analyze( + text="Mon numéro de téléphone est 01 23 45 67 89", + language='fr' +) +``` + +## Extending Presidio + +### Creating Custom Recognizers + +See `docs/analyzer/developing_recognizers.md` for detailed guide. + +### Creating Custom Anonymizers + +See `docs/anonymizer/adding_operators.md` for detailed guide. + +### Integrating External Services + +See `docs/tutorial/04_external_services.md` for integrating services like Azure AI Language. + +## Important Files & Directories + +| Path | Purpose | +|------|---------| +| `docs/` | Comprehensive documentation (MkDocs) | +| `docs/tutorial/` | Step-by-step tutorials | +| `docs/samples/` | Usage examples and deployment samples | +| `docs/api/` | Python API reference | +| `CONTRIBUTING.md` | Contribution guidelines | +| `SECURITY.md` | Security policy | +| `CHANGELOG.md` | Version history | +| `mkdocs.yml` | Documentation site configuration | +| `pyproject.toml` | Ruff linter/formatter configuration | +| `.pre-commit-config.yaml` | Pre-commit hooks configuration | + +## Build & Release + +See `docs/build_release.md` for information on: +- Building Python packages +- Building Docker images +- Release process +- Version management + +## Common Issues & FAQ + +See `docs/faq.md` for frequently asked questions including: +- Performance optimization +- Accuracy improvement +- Custom entity types +- Memory usage +- Integration patterns + +## Support & Contributing + +- **Issues**: https://github.com/microsoft/presidio/issues +- **Discussions**: https://github.com/microsoft/presidio/discussions +- **Email**: presidio@microsoft.com +- **Contributing Guide**: See CONTRIBUTING.md +- **Code of Conduct**: See CODE_OF_CONDUCT + +## License + +MIT License - See LICENSE file + +## Warning + +Presidio uses automated detection mechanisms. There is no guarantee that Presidio will find all sensitive information. Additional systems and protections should be employed in production environments. + +## Quick Reference - Common Commands + +```bash +# Install all components +pip install presidio-analyzer presidio-anonymizer presidio-image-redactor presidio-structured + +# Run tests +pytest presidio-analyzer/tests + +# Format code +ruff format . + +# Build docs locally +mkdocs serve + +# Run docker services +docker-compose up + +# Run pre-commit checks +pre-commit run --all-files +``` + +## Next Steps + +1. Read the [Getting Started Guide](docs/getting_started.md) +2. Explore [Tutorials](docs/tutorial/index.md) +3. Check [Sample Implementations](docs/samples/index.md) +4. Review [API Documentation](docs/api/analyzer_python.md) +5. See [Deployment Examples](docs/samples/deployments/index.md)