Skip to content

Phase 5: Script & Health Hardening#242

Draft
paultranvan wants to merge 5 commits intohardening/phase-4from
hardening/phase-5
Draft

Phase 5: Script & Health Hardening#242
paultranvan wants to merge 5 commits intohardening/phase-4from
hardening/phase-5

Conversation

@paultranvan
Copy link
Collaborator

Summary

  • Enhance /health_check endpoint with concurrent LLM/VLM service probes and response time metrics
  • Harden restore script with state tracking, rollback on critical failure, and progress logging

Changes

  • api.py: check_service_health() helper, asyncio.gather for concurrent probes, HTTP 503 if LLM down, 200 degraded if only VLM down
  • scripts/restore.py: restore_state dict, critical vs non-critical failure handling, reverse-order rollback (VDB first then RDB), progress logging every 100 files, error cap at 100

Test plan

  • All 98 existing tests pass
  • ruff check openrag/ passes
  • Health endpoint responds within ~3s even with slow services
  • Restore script rolls back on critical failure

🤖 Generated with Claude Code

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hardening/phase-5

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

paultranvan and others added 5 commits February 12, 2026 22:31
- Add check_service_health() helper function with 3-second timeout
- Probe LLM and VLM services concurrently using asyncio.gather
- Return HTTP 503 when LLM (critical) is unhealthy
- Return HTTP 200 with degraded status when only VLM (non-critical) is down
- Include response_time_ms metrics for each service
- Use httpx.AsyncClient with proper timeout and exception handling
- Add restore_state dict to track partitions_created, files_added, files_failed, chunks_inserted, errors
- Fix MilvusDB init failure TODO: now returns 1 to stop execution
- Distinguish critical vs non-critical failures: file insert failures log and continue instead of raising
- Track partition creation on first successful file add
- Log progress milestones every 100 files processed
- Log final summary with counts and first 10 errors
- Cap error list at 100 entries to prevent memory issues
- Replace bare exception handler with proper rollback flow
- Log critical failure with restore_state context
- Roll back in reverse order: VDB first (prevents orphaned vectors), then RDB (cascades to files)
- Log per-partition rollback success/failure
- Log rollback summary with partition count
- Re-raise exception after rollback for proper exit code
- Rollback loop handles empty partition list (early failures)
The LLM/VLM base_url config includes the API path (e.g. http://host:8000/v1/)
but the /health endpoint is at the service root. Strip /v1 before probing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The enhanced health check now returns JSON with status/checks instead
of plain text. Update assertion to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant