DocExtract AI

Extract structured data from unstructured documents in seconds -- not hours.

For Hiring Managers

If you're evaluating for...	Where to look	Training behind it
AI / ML Engineer	Agentic RAG ReAct loop (`app/services/agentic_rag.py`), RAGAS evaluation pipeline (`app/services/ragas_evaluator.py`), QLoRA fine-tuning pipeline (`scripts/train_qlora.py`, `adapters/`), W&B experiment tracking, golden eval CI gate	IBM GenAI Engineering (144h), IBM RAG & Agentic AI (24h), DeepLearning.AI Deep Learning (120h)
Backend / Platform Engineer	Circuit breaker model fallback (`app/services/circuit_breaker.py`), async ARQ job queue (`worker/`), K8s/HPA manifests (`deploy/k8s/`), Terraform IaC (`deploy/aws/`), sliding-window rate limiter	Microsoft AI & ML Engineering (75h), Google Cloud GenAI Leader (25h)
Full-Stack AI Engineer	14-page Streamlit dashboard (`frontend/`), SSE streaming progress, MCP tool server (`mcp_server.py`), interactive demo sandbox	IBM BI Analyst (141h), Google Data Analytics (181h), Microsoft Data Viz (87h)
MLOps / LLMOps Engineer	Prompt versioning + regression testing (`app/services/prompt_registry.py`), model A/B testing with z-test significance (`app/services/model_ab_test.py`), DeepEval CI gates, cost tracking per request	Duke LLMOps (48h), Google Advanced Data Analytics (200h)

→ Full cert-to-code mapping: docs/certifications.md (1,208h across 14 certifications)

Quickstart

git clone https://github.com/ChunkyTortoise/docextract.git
cd docextract
cp .env.example .env  # Add ANTHROPIC_API_KEY + GEMINI_API_KEY
docker compose up -d
open http://localhost:8501  # Streamlit UI

Services: API at :8000 (/docs for Swagger) | Frontend at :8501 | PostgreSQL :5432 | Redis :6379

Demo

First visit may take 30 seconds to wake up. Pre-cached results for invoice, contract, and receipt extraction.

Local demo (no API key needed):

DEMO_MODE=true streamlit run frontend/app.py

Architecture

graph LR
  A[Browser / API Client] -->|POST /documents| B[FastAPI]
  B -->|enqueue| C[ARQ Worker]
  C -->|classify| D{Model Router}
  D -->|primary| E[Claude Sonnet]
  D -->|fallback| F[Claude Haiku]
  E -->|Pass 2: extract + correct| G[pgvector HNSW]
  B -->|SSE stream stages| A
  G -->|semantic search| B
  B -->|/metrics| H[Prometheus]
  D --- I[Circuit Breaker]

Supported Models

Model	Provider	Env Var	Notes
`claude-sonnet-4-6`	Anthropic	`ANTHROPIC_API_KEY`	Default extraction model
`claude-haiku-4-5-20251001`	Anthropic	`ANTHROPIC_API_KEY`	Default classification + circuit breaker fallback
`glm-4-plus`	Zhipu AI	`ZHIPUAI_API_KEY`	Chinese AI model, OpenAI-compatible API
`glm-4-flash`	Zhipu AI	`ZHIPUAI_API_KEY`	Fast/cheap GLM variant
Gemini (embedding)	Google	`GEMINI_API_KEY`	Used for pgvector embeddings only

GLM-4 models use an OpenAI-compatible endpoint (https://open.bigmodel.cn/api/paas/v4/). Configure via EXTRACTION_MODELS env var.

Screenshots

Upload & Extraction	Extracted Records & ROI

SSE Streaming Demo

Real-time progress: PREPROCESSING > EXTRACTING > CLASSIFYING > VALIDATING > EMBEDDING > COMPLETED

Key Capabilities

Extraction: Two-pass Claude pipeline (draft + verify via tool_use), 6 document types, 94.6% accuracy on 28-fixture eval suite
Search & RAG: pgvector semantic search (768-dim HNSW), hybrid BM25+RRF retrieval, agentic ReAct loop with 5 tools, map-reduce multi-document synthesis, semantic deduplication cache
Reliability: Circuit breaker (Sonnet to Haiku fallback), dead-letter queue, idempotent retries, HMAC-signed webhooks with 4-attempt retry, SHA-256 upload dedup
Observability: OpenTelemetry traces (Jaeger/Tempo), Prometheus metrics, Grafana dashboards, per-request cost tracking, structured logging
Developer Experience: SSE streaming progress, MCP server integration, prompt versioning (semver), model A/B testing (z-test), 12 ADRs, 90%+ test coverage

Performance

Metric	Value
Document extraction (p50)	~8s (two-pass Claude)
SSE first token (p50)	<500ms
Semantic search (p95)	<100ms
Extraction accuracy (golden eval)	94.6% across 28 fixtures, 6 document types
Test suite	~5s (1,155 tests)
Coverage	90%+ (CI-enforced)

Evaluation Results

Measured against 28 hand-crafted golden fixtures (including 12 adversarial cases with 4 prompt injection attacks) covering all 6 document types. Scores are field-level F1. Run in CI on every push.

Document Type	Accuracy
Invoice	95.0%
Purchase Order	96.4%
Bank Statement	91.6%
Medical Record	98.9%
Receipt	82.1%
Identity Document	81.4%
Overall	94.6%

# Reproduce locally (no API calls):
python scripts/run_eval_ci.py --ci

Project Structure

app/
  api/          -- FastAPI route modules (10 routers)
  auth/         -- API key auth + rate limiting middleware
  models/       -- SQLAlchemy models (8 tables)
  schemas/      -- Pydantic request/response schemas
  services/     -- Extraction, classification, embedding, validation
  storage/      -- Pluggable storage backends (local, R2)
  utils/        -- Hashing, MIME detection, token counting
worker/         -- ARQ async job processor
frontend/       -- Streamlit 14-page dashboard
alembic/        -- Database migrations (001-010)
scripts/        -- Seed scripts (API keys, sample docs, cleanup)
tests/          -- Unit + integration tests

Architecture Decisions

12 Architecture Decision Records (ADRs) document the key design choices: docs/adr/

ADR	Decision
ADR-0001	ARQ over Celery for async job queue
ADR-0002	pgvector over Pinecone/Weaviate
ADR-0003	Two-pass Claude extraction with confidence gating
ADR-0006	Circuit breaker model fallback chain
ADR-0011	API key auth over OAuth/JWT
ADR-0012	Pluggable storage backend (Local/R2)

Production Readiness

Deployed with: Docker Compose / AWS Terraform (RDS + ElastiCache + ECR) / Kubernetes (Kustomize + HPA). Grafana observability. 80% coverage gate. 94.6% eval gate in CI. HIPAA/SOC2 alignment documented.

Cloud infrastructure (deploy/aws/main.tf, deploy/k8s/): Full Terraform IaC for AWS — RDS PostgreSQL, ElastiCache Redis, ECR registry, EC2 with auto-scaling. Kubernetes manifests with Kustomize overlays, Horizontal Pod Autoscaler, and Ingress. Applied from Google Cloud GenAI Leader (25h) coursework.

Document	Purpose
SLO Targets	Latency, availability, quality, cost targets
Common Failure Runbook	Circuit breaker, Redis, DB, queue, vector index recovery
Security Guide	API keys, webhooks, CORS, data handling
Compliance & Privacy	HIPAA/PHI handling, PII detection, SOC 2 alignment
Architecture	Full system architecture overview
Case Study	Engineering journey from prototype to production
MCP Integration	Claude Desktop / agent framework setup
Cost Model	Token costs, per-document pricing, volume estimates
Certifications Applied	1,208h across 14 certifications mapped to features

Deployment

Render (one-click):

Kubernetes: kubectl apply -k deploy/k8s/ (HPA auto-scaling, nginx ingress, SSE buffering disabled)

AWS Terraform: cd deploy/aws && terraform apply (EC2 + RDS PostgreSQL 16 + ElastiCache Redis 7, free-tier eligible)

See deploy/ for full manifests and configuration.

Running Tests

pytest tests/ -v                      # Full suite (1,155 tests, ~5s)
pytest tests/ -v --run-eval           # Include golden eval (requires API key)
python scripts/run_eval_ci.py --ci    # Deterministic eval (no API key)

Known Limitations

Tesseract degradation on handwriting: OCR accuracy drops significantly on handwritten documents. Set OCR_ENGINE=vision to route through Claude's vision API instead.
English-only extraction prompts: Non-English documents may extract with lower accuracy.

Contributing

See CONTRIBUTING.md for development setup, testing, and PR guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.claude		.claude
.github		.github
.streamlit		.streamlit
adapters		adapters
alembic		alembic
app		app
autoresearch		autoresearch
deploy		deploy
docs		docs
frontend		frontend
notebooks		notebooks
prompts		prompts
scripts		scripts
storage/reports		storage/reports
tests		tests
worker		worker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaksignore		.gitleaksignore
.python-version		.python-version
CASE_STUDY.md		CASE_STUDY.md
CONTRIBUTING.md		CONTRIBUTING.md
DECISIONS.md		DECISIONS.md
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
alembic.ini		alembic.ini
docker-compose.observability.yml		docker-compose.observability.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
locust.conf		locust.conf
mcp_server.py		mcp_server.py
offer-kit.yaml		offer-kit.yaml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
requirements.txt		requirements.txt
requirements_demo.txt		requirements_demo.txt
requirements_full.txt		requirements_full.txt
streamlit_demo.py		streamlit_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocExtract AI

For Hiring Managers

Quickstart

Demo

Architecture

Supported Models

Screenshots

SSE Streaming Demo

Key Capabilities

Performance

Evaluation Results

Project Structure

Architecture Decisions

Production Readiness

Deployment

Running Tests

Known Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocExtract AI

For Hiring Managers

Quickstart

Demo

Architecture

Supported Models

Screenshots

SSE Streaming Demo

Key Capabilities

Performance

Evaluation Results

Project Structure

Architecture Decisions

Production Readiness

Deployment

Running Tests

Known Limitations

Contributing

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages