Skip to content

Add CUDA health checking to /healthz endpoint#2204

Open
hansent wants to merge 2 commits intomainfrom
cuda-health-check
Open

Add CUDA health checking to /healthz endpoint#2204
hansent wants to merge 2 commits intomainfrom
cuda-health-check

Conversation

@hansent
Copy link
Copy Markdown
Collaborator

@hansent hansent commented Apr 6, 2026

Description

When a GPU encounters a fatal CUDA error (e.g., "illegal memory access"), the CUDA context is permanently corrupted — all subsequent GPU operations fail. But the current /healthz endpoint always returns {"status": "healthy"}, so pods keep receiving work and failing on every request until something else (timeout accumulation, OOM) eventually triggers a restart.

This PR makes /healthz actually validate GPU health by checking the CUDA runtime on each call.

Companion PR: [roboflow/async-serverless#227 (points consumer health monitors, k8s probes, and sidecar at /healthz)

What changed

New module: inference/core/utils/cuda_health.py

A CudaHealthChecker singleton that:

  • Calls torch.cuda.synchronize() to surface any pending asynchronous CUDA errors (the most common way "illegal memory access" manifests — CUDA errors are reported asynchronously)
  • Calls torch.cuda.mem_get_info() to verify the CUDA runtime is still queryable
  • Caches failure permanently — CUDA context corruption is unrecoverable, so after the first failure we set a flag and never call CUDA again. Subsequent health checks return in nanoseconds.
  • On CPU-only servers: torch.cuda.is_available() is checked once and cached. Health check always returns healthy, zero overhead.
  • Thread-safe with double-checked locking

Performance: ~2-15 microseconds when healthy (CUDA synchronize + mem_get_info), nanoseconds after failure or on CPU.

Updated /healthz endpoint

Now returns 503 with {"status": "unhealthy", "reason": "cuda_error", "detail": "..."} when CUDA is broken. Uses lazy import matching existing patterns in the codebase (base.py:680).

How the recovery chain works after this change

CUDA error occurs (e.g., illegal memory access)
  → /healthz returns 503
  → Consumer health monitor: Healthy → Degraded → Unhealthy (3 consecutive failures)
  → Coordinator pauses message consumption (no new work to this pod)
  → K8s liveness probe fails 3× → restarts inference container
  → If restart doesn't help → pod sidecar detects → full pod restart

The existing async-serverless health monitoring chain already has all the right failure-reaction logic — it just never got a failure signal because /healthz was blind to CUDA state.

Type of change

  • New feature (non-breaking change which adds functionality)

How has this change been tested?

  • 8 unit tests in tests/unit/core/test_cuda_health.py:
    • CPU environment (no torch / CUDA unavailable) → always healthy
    • Healthy GPU (mock synchronize + mem_get_info) → healthy
    • CUDA synchronize failure → unhealthy, error cached
    • mem_get_info failure → unhealthy
    • Failure caching verified (torch not called after first failure)
    • GPU availability caching verified
  • All tests passing: pytest tests/unit/core/test_cuda_health.py

Any specific deployment considerations

  • Backward compatible/healthz currently isn't used by any health monitors (they poll / or /info). The companion async-serverless PR switches them to /healthz.
  • Deploy this first, then the async-serverless config changes. No coordination required.
  • On CPU-only deployments (cpu-direct, inference without GPU): zero behavioral change, health check returns healthy immediately.

Docs

  • Docs updated? No documentation changes needed

The /healthz endpoint now verifies CUDA context health when running on
GPU by calling torch.cuda.synchronize() (surfaces async errors) and
torch.cuda.mem_get_info() (verifies runtime). Returns 503 when CUDA is
corrupted.

Failure state is cached permanently since CUDA context corruption is
unrecoverable -- subsequent health checks return instantly without
touching CUDA. On CPU-only servers, behaves exactly as before.
Log the error server-side instead of exposing it in the HTTP response.
Addresses CodeQL information-exposure-through-an-exception finding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants