Feature/evaluation - Dataset Interface by RonShakutai · Pull Request #1893 · microsoft/presidio

RonShakutai · 2026-03-08T12:11:39Z

Summary

Adds real dataset loading capabilities to the evaluation flow, replacing the static mock-only UI with an interactive dataset interface. Users can load CSV/JSON files from local paths, preview records, and configure detection options based on dataset content. Aligns the entity schema with Presidio Analyzer's RecognizerResult format.

What's Included

Dataset Loading (Backend)

New router backend/routers/upload.py — POST /api/datasets/load accepts a local file path, format (CSV/JSON), and column names. Parses the file, extracts entities if present, and stores records in memory.
- CSV parser via csv.DictReader with configurable text/entities columns
- JSON parser supporting both JSON arrays and JSONL format
- Preview endpoint (GET /api/datasets/{id}/preview?limit=5) and full records endpoint
New models — DatasetLoadRequest (path, format, text_column, entities_column) and UploadedDataset (id, filename, format, record_count, has_entities, columns)
Sample dataset — data/sample_medical_records.csv with 10 medical records containing pre-tagged entities (PERSON, DATE_TIME, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, etc.)

Dataset Selector (Frontend — Setup Page)

Replaced static dataset list with a dropdown selector containing a seed example and an "Add new dataset…" option
Inline form for loading datasets: absolute path input, format dropdown (CSV/JSON), text column name, optional entities column name
Selected dataset displays a green summary card with record count, format, entity status, and a preview of the first few records
Detection options (Run Presidio / Run LLM) appear when a dataset is selected

Greyed-Out Sections

Compliance Context and Data Access Constraints cards are visible but disabled (opacity-50, pointer-events-none) with "Coming soon" badges
Data Access Constraints text notes that only cloud-based LLM processing is currently supported

Entity Schema Alignment

Renamed Entity.type → Entity.entity_type across the entire codebase (backend models, mock data, frontend types, all components) to match Presidio Analyzer's RecognizerResult output format
Added dataset_entities field to Record model for entities loaded from datasets

Conditional Flow

Anonymization page — reads setupConfig from session storage and conditionally renders Presidio/LLM processing cards based on user selection
Human Review page — "Skip Tagging" button for datasets with pre-existing entities; uses entity_type field throughout
EntityComparison component — supports optional datasetEntities prop

screenshots

…auto-accept functionality - Updated HumanReview component to include a "Skip Tagging" button that auto-accepts all entities from records. - Integrated session storage for setup configuration in HumanReview. - Modified Setup component to allow loading datasets from CSV/JSON files with a preview feature. - Added new types for UploadedDataset and SetupConfig to manage dataset metadata. - Implemented backend API for loading datasets, including CSV and JSON parsing. - Created sample medical records dataset for testing and demonstration purposes.

RonShakutai added 2 commits March 8, 2026 13:39

feat: Implement auto-confirm all functionality in Human Review page

b0b54cf

RonShakutai self-assigned this Mar 8, 2026

RonShakutai requested a review from a team as a code owner March 8, 2026 12:11

RonShakutai merged commit 0054d54 into ronshakutai/presidio-evaluation-repo Mar 8, 2026
2 checks passed

RonShakutai deleted the ronshakutai/evaluation/dataset-interface branch March 8, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/evaluation - Dataset Interface#1893

Feature/evaluation - Dataset Interface#1893
RonShakutai merged 2 commits intoronshakutai/presidio-evaluation-repofrom
ronshakutai/evaluation/dataset-interface

RonShakutai commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RonShakutai commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

Dataset Loading (Backend)

Dataset Selector (Frontend — Setup Page)

Greyed-Out Sections

Entity Schema Alignment

Conditional Flow

screenshots

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RonShakutai commented Mar 8, 2026 •

edited

Loading