microsoft · SAIRAMSSSS · Nov 21, 2025 · Dec 8, 2025 · Dec 17, 2025 · Copilot
diff --git a/docs/docker_customization.md b/docs/docker_customization.md
@@ -0,0 +1,234 @@
+# Customizing Presidio Docker Images
+
+## Overview
+
+This guide provides detailed instructions on how to build and customize Presidio Docker images to support additional languages and custom configurations. The official Presidio Docker images support English by default, but you can create custom images to work with other languages.
+
+## Prerequisites
+
+- Docker installed ([Download Docker](https://docs.docker.com/get-docker/))
+- Basic knowledge of Docker and YAML
+- Familiarity with spaCy language models
+
+## Understanding Presidio's Docker Architecture
+
+Presidio consists of three main Docker images:
+- `presidio-analyzer`: Detects PII entities in text
+- `presidio-anonymizer`: Anonymizes detected PII
+- `presidio-image-redactor`: Redacts PII from images
+
+For multi-language support, you'll primarily need to customize the `presidio-analyzer` image.
+
+## Step 1: Clone the Presidio Repository
+
+First, clone the Presidio repository:
+
+```bash
+git clone https://github.com/microsoft/presidio.git
+cd presidio
+```
+
+## Step 2: Locate Configuration Files
+
+The key files for customization are:
+
+- `presidio-analyzer/Dockerfile`: Defines the analyzer Docker image
+- `presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml`: Configures recognizers
+
+## Step 3: Modify the Dockerfile for Additional Languages
+
+Navigate to `presidio-analyzer/Dockerfile` and add your desired spaCy language models.
+
+### Example: Adding Spanish Support
+
+In the Dockerfile, locate the section where spaCy models are downloaded and add:
+
+```dockerfile
+RUN python -m spacy download es_core_news_md
+```
+
+### Example: Adding Multiple Languages
+
+```dockerfile
+# Install language models
+RUN python -m spacy download en_core_web_lg
+RUN python -m spacy download es_core_news_md  # Spanish
+RUN python -m spacy download fr_core_news_md  # French
+RUN python -m spacy download de_core_news_md  # German
+```
+
+## Step 4: Configure Language Support
+
+### Update Configuration File
+
+Modify the recognizers configuration to support your languages. Edit `presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml`:
+
+```yaml
+# Add supported languages
+supported_languages:
+  - en
+  - es
+  - fr
+  - de
+```
+
+## Step 5: Build the Custom Docker Image
+
+Build your customized Docker image:
+
+```bash
+cd presidio-analyzer
+docker build . -t presidio-analyzer-custom:latest
+```
+
+## Step 6: Run Your Custom Image
+
+Run the custom image:
+
+```bash
+docker run -d -p 5002:3000 presidio-analyzer-custom:latest
+```
+
+## Common Pitfalls and Best Practices
+
+### 1. Memory Issues with Multiple Languages
+
+**Problem**: Adding 10+ languages at once can cause the Docker image to run out of memory during build or runtime.
+
+**Solutions**:
+- Use smaller spaCy models (e.g., `es_core_news_sm` instead of `es_core_news_lg`)
+- Increase Docker memory allocation:
+  ```bash
+  docker run -d -p 5002:3000 --memory="4g" presidio-analyzer-custom:latest
+  ```
+- Build images with only the languages you actually need
+- Consider using transformers models which can be more memory-efficient
- Consider using transformers models which can be more memory-efficient
+- Consider using a single multilingual transformers model instead of multiple language-specific spaCy models, which can reduce the total model size
- Consider using transformers models which can be more memory-efficient
+- Consider using a single multilingual transformers model instead of multiple language-specific spaCy models, which can reduce the total model size
+
+### 2. Warning: NLP Recognizer Not in List
+
+If you see warnings like:
+```
+UserWarning: NLP recognizer (e.g. SpacyRecognizer, StanzaRecognizer) is not in the list of recognizers for language en.
+```
+
+**Solution**: Ensure your language configuration matches your installed models:
+
+1. Check `default_recognizers.yaml` includes your language
+2. Verify the spaCy model is properly downloaded in the Dockerfile
+3. Ensure the language code matches (e.g., 'en' for English, 'es' for Spanish)
+
+### 3. Model Size vs. Accuracy Trade-off
+
+spaCy offers different model sizes:
+- `sm` (small): ~15MB, faster but less accurate
+- `md` (medium): ~40MB, balanced
+- `lg` (large): ~500MB+, most accurate but resource-intensive
+
+**Recommendation**: Start with `md` models for a good balance.
+
+## Complete Example: Building a Multi-Language Analyzer
+
+Here's a complete example for Spanish and French support:
+
+### Modified Dockerfile
+
+```dockerfile
+FROM python:3.11-slim
-FROM python:3.11-slim
+FROM python:3.12-slim
-FROM python:3.11-slim
+FROM python:3.12-slim
+
+WORKDIR /usr/bin/presidio-analyzer
+
+# Copy requirements
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+
+# Install spaCy language models
+RUN python -m spacy download en_core_web_lg
+RUN python -m spacy download es_core_news_md
+RUN python -m spacy download fr_core_news_md
+
+# Copy application code
+COPY . .
+
+EXPOSE 3000
+
+CMD ["python", "app.py"]
-WORKDIR /usr/bin/presidio-analyzer
-
-# Copy requirements
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-
-# Install spaCy language models
-RUN python -m spacy download en_core_web_lg
-RUN python -m spacy download es_core_news_md
-RUN python -m spacy download fr_core_news_md
-
-# Copy application code
-COPY . .
-
-EXPOSE 3000
-
-CMD ["python", "app.py"]
+FROM python:3.12-slim
+
+# Install system dependencies and Poetry
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl build-essential && \
+    pip install --no-cache-dir poetry && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /usr/bin/presidio-analyzer
+
+# Copy project metadata and install Python dependencies with Poetry
+COPY pyproject.toml poetry.lock ./
+RUN poetry install --no-dev --no-interaction --no-ansi
+
+# Copy NLP configuration which controls which spaCy models are installed
+# Customize this YAML file (or provide your own via build args) instead of
+# adding `python -m spacy download ...` commands in the Dockerfile.
+COPY conf ./conf
+
+# Install NLP models based on the configuration file
+RUN poetry run python -m presidio_analyzer.install_nlp_models --conf-file conf/default.yaml
+
+# Copy application code
+COPY . .
+
+# Create and switch to a non-root user
+RUN useradd -m presidio && chown -R presidio /usr/bin/presidio-analyzer
+USER presidio
+
+EXPOSE 3000
+
+# Simple health check endpoint is expected at /health
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+  CMD curl -fsS http://localhost:3000/health || exit 1
+
+# Start the analyzer service (command may vary by version; see official Dockerfile)
+CMD ["poetry", "run", "gunicorn", "-b", "0.0.0.0:3000", "presidio_analyzer.entrypoints.app:app"]
-WORKDIR /usr/bin/presidio-analyzer
-
-# Copy requirements
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-
-# Install spaCy language models
-RUN python -m spacy download en_core_web_lg
-RUN python -m spacy download es_core_news_md
-RUN python -m spacy download fr_core_news_md
-
-# Copy application code
-COPY . .
-
-EXPOSE 3000
-
-CMD ["python", "app.py"]
+FROM python:3.12-slim
+
+# Install system dependencies and Poetry
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl build-essential && \
+    pip install --no-cache-dir poetry && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /usr/bin/presidio-analyzer
+
+# Copy project metadata and install Python dependencies with Poetry
+COPY pyproject.toml poetry.lock ./
+RUN poetry install --no-dev --no-interaction --no-ansi
+
+# Copy NLP configuration which controls which spaCy models are installed
+# Customize this YAML file (or provide your own via build args) instead of
+# adding `python -m spacy download ...` commands in the Dockerfile.
+COPY conf ./conf
+
+# Install NLP models based on the configuration file
+RUN poetry run python -m presidio_analyzer.install_nlp_models --conf-file conf/default.yaml
+
+# Copy application code
+COPY . .
+
+# Create and switch to a non-root user
+RUN useradd -m presidio && chown -R presidio /usr/bin/presidio-analyzer
+USER presidio
+
+EXPOSE 3000
+
+# Simple health check endpoint is expected at /health
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+  CMD curl -fsS http://localhost:3000/health || exit 1
+
+# Start the analyzer service (command may vary by version; see official Dockerfile)
+CMD ["poetry", "run", "gunicorn", "-b", "0.0.0.0:3000", "presidio_analyzer.entrypoints.app:app"]
+```
+
+### Test Your Custom Image
+
+```bash
+# Build the image
+docker build -t my-presidio-analyzer .
+
+# Run the container
+docker run -d -p 5002:3000 my-presidio-analyzer
+
+# Test with curl
+curl -X POST http://localhost:5002/analyze \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Mi nombre es David y mi email es david@example.com", "language": "es"}'
+```
+
+## Using Docker Compose
+
+For complex setups, use docker-compose.yml:
+
+```yaml
+version: '3.8'
-version: '3.8'
-version: '3.8'
+services:
+  presidio-analyzer:
+    build:
+      context: ./presidio-analyzer
+    ports:
+      - "5002:3000"
+    environment:
+      - LOG_LEVEL=INFO
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+
+  presidio-anonymizer:
+    image: mcr.microsoft.com/presidio-anonymizer:latest
+    ports:
+      - "5001:3000"
-    deploy:
-      resources:
-        limits:
-          memory: 4G
-  
-  presidio-anonymizer:
-    image: mcr.microsoft.com/presidio-anonymizer:latest
-    ports:
-      - "5001:3000"
+      - PORT=3000
+    deploy:
+      resources:
+        limits:
+          cpus: "1.0"
+          memory: 4G
+
+  presidio-anonymizer:
+    image: mcr.microsoft.com/presidio-anonymizer:latest
+    ports:
+      - "5001:3000"
+    environment:
+      - PORT=3000
-    deploy:
-      resources:
-        limits:
-          memory: 4G
-  
-  presidio-anonymizer:
-    image: mcr.microsoft.com/presidio-anonymizer:latest
-    ports:
-      - "5001:3000"
+      - PORT=3000
+    deploy:
+      resources:
+        limits:
+          cpus: "1.0"
+          memory: 4G
+
+  presidio-anonymizer:
+    image: mcr.microsoft.com/presidio-anonymizer:latest
+    ports:
+      - "5001:3000"
+    environment:
+      - PORT=3000
+```
+
+Run with:
+```bash
+docker-compose up --build
+```
+
+## Additional Resources
+
+- [Presidio Analyzer Documentation](https://microsoft.github.io/presidio/analyzer/)
+- [spaCy Language Models](https://spacy.io/models)
+- [Presidio Custom Recognizers](https://microsoft.github.io/presidio/analyzer/adding_recognizers/)
+- [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/)
- [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/)
+- [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider.md)
- [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/)
+- [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider.md)
+
+## Troubleshooting
+
+### Issue: Build fails with "No space left on device"
+**Solution**: Clean up Docker resources:
+```bash
+docker system prune -a
+```
+
+### Issue: Container crashes on startup
+**Solution**: Check logs and increase memory:
+```bash
+docker logs <container-id>
+docker run --memory="6g" ...
+```
+
+## Contributing
+
+For questions or contributions, please refer to the [Presidio Contributing Guide](https://github.com/microsoft/presidio/blob/main/CONTRIBUTING.md).
+
+## Related Documentation
+
+- [Installation Guide](./installation.md)
+- [Getting Started with Presidio](./getting_started/getting_started_text.md)
+- [Supported Languages](https://microsoft.github.io/presidio/analyzer/languages/)
+
+This guide addresses [Issue #1663](https://github.com/microsoft/presidio/issues/1663) - More elaborate description for building custom Docker images for Presidio.