Skip to content

Feature/evaluation - Dataset Interface#1893

Merged
RonShakutai merged 2 commits intoronshakutai/presidio-evaluation-repofrom
ronshakutai/evaluation/dataset-interface
Mar 8, 2026
Merged

Feature/evaluation - Dataset Interface#1893
RonShakutai merged 2 commits intoronshakutai/presidio-evaluation-repofrom
ronshakutai/evaluation/dataset-interface

Conversation

@RonShakutai
Copy link
Collaborator

@RonShakutai RonShakutai commented Mar 8, 2026

Summary

Adds real dataset loading capabilities to the evaluation flow, replacing the static mock-only UI with an interactive dataset interface. Users can load CSV/JSON files from local paths, preview records, and configure detection options based on dataset content. Aligns the entity schema with Presidio Analyzer's RecognizerResult format.

What's Included

Dataset Loading (Backend)

  • New router backend/routers/upload.pyPOST /api/datasets/load accepts a local file path, format (CSV/JSON), and column names. Parses the file, extracts entities if present, and stores records in memory.
    • CSV parser via csv.DictReader with configurable text/entities columns
    • JSON parser supporting both JSON arrays and JSONL format
    • Preview endpoint (GET /api/datasets/{id}/preview?limit=5) and full records endpoint
  • New modelsDatasetLoadRequest (path, format, text_column, entities_column) and UploadedDataset (id, filename, format, record_count, has_entities, columns)
  • Sample datasetdata/sample_medical_records.csv with 10 medical records containing pre-tagged entities (PERSON, DATE_TIME, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, etc.)

Dataset Selector (Frontend — Setup Page)

  • Replaced static dataset list with a dropdown selector containing a seed example and an "Add new dataset…" option
  • Inline form for loading datasets: absolute path input, format dropdown (CSV/JSON), text column name, optional entities column name
  • Selected dataset displays a green summary card with record count, format, entity status, and a preview of the first few records
  • Detection options (Run Presidio / Run LLM) appear when a dataset is selected

Greyed-Out Sections

  • Compliance Context and Data Access Constraints cards are visible but disabled (opacity-50, pointer-events-none) with "Coming soon" badges
  • Data Access Constraints text notes that only cloud-based LLM processing is currently supported

Entity Schema Alignment

  • Renamed Entity.typeEntity.entity_type across the entire codebase (backend models, mock data, frontend types, all components) to match Presidio Analyzer's RecognizerResult output format
  • Added dataset_entities field to Record model for entities loaded from datasets

Conditional Flow

  • Anonymization page — reads setupConfig from session storage and conditionally renders Presidio/LLM processing cards based on user selection
  • Human Review page — "Skip Tagging" button for datasets with pre-existing entities; uses entity_type field throughout
  • EntityComparison component — supports optional datasetEntities prop

screenshots

image image image image image

…auto-accept functionality

- Updated HumanReview component to include a "Skip Tagging" button that auto-accepts all entities from records.
- Integrated session storage for setup configuration in HumanReview.
- Modified Setup component to allow loading datasets from CSV/JSON files with a preview feature.
- Added new types for UploadedDataset and SetupConfig to manage dataset metadata.
- Implemented backend API for loading datasets, including CSV and JSON parsing.
- Created sample medical records dataset for testing and demonstration purposes.
@RonShakutai RonShakutai self-assigned this Mar 8, 2026
@RonShakutai RonShakutai requested a review from a team as a code owner March 8, 2026 12:11
@RonShakutai RonShakutai merged commit 0054d54 into ronshakutai/presidio-evaluation-repo Mar 8, 2026
2 checks passed
@RonShakutai RonShakutai deleted the ronshakutai/evaluation/dataset-interface branch March 8, 2026 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant