Add CUDA health checking to /healthz endpoint by hansent · Pull Request #2204 · roboflow/inference

hansent · 2026-04-06T19:13:17Z

Description

When a GPU encounters a fatal CUDA error (e.g., "illegal memory access"), the CUDA context is permanently corrupted — all subsequent GPU operations fail. But the current /healthz endpoint always returns {"status": "healthy"}, so pods keep receiving work and failing on every request until something else (timeout accumulation, OOM) eventually triggers a restart.

This PR makes /healthz actually validate GPU health by checking the CUDA runtime on each call.

Companion PR: [roboflow/async-serverless#227 (points consumer health monitors, k8s probes, and sidecar at /healthz)

What changed

New module: `inference/core/utils/cuda_health.py`

A CudaHealthChecker singleton that:

Calls torch.cuda.synchronize() to surface any pending asynchronous CUDA errors (the most common way "illegal memory access" manifests — CUDA errors are reported asynchronously)
Calls torch.cuda.mem_get_info() to verify the CUDA runtime is still queryable
Caches failure permanently — CUDA context corruption is unrecoverable, so after the first failure we set a flag and never call CUDA again. Subsequent health checks return in nanoseconds.
On CPU-only servers: torch.cuda.is_available() is checked once and cached. Health check always returns healthy, zero overhead.
Thread-safe with double-checked locking

Performance: ~2-15 microseconds when healthy (CUDA synchronize + mem_get_info), nanoseconds after failure or on CPU.

Updated `/healthz` endpoint

Now returns 503 with {"status": "unhealthy", "reason": "cuda_error", "detail": "..."} when CUDA is broken. Uses lazy import matching existing patterns in the codebase (base.py:680).

How the recovery chain works after this change

CUDA error occurs (e.g., illegal memory access)
  → /healthz returns 503
  → Consumer health monitor: Healthy → Degraded → Unhealthy (3 consecutive failures)
  → Coordinator pauses message consumption (no new work to this pod)
  → K8s liveness probe fails 3× → restarts inference container
  → If restart doesn't help → pod sidecar detects → full pod restart

The existing async-serverless health monitoring chain already has all the right failure-reaction logic — it just never got a failure signal because /healthz was blind to CUDA state.

Type of change

New feature (non-breaking change which adds functionality)

How has this change been tested?

8 unit tests in tests/unit/core/test_cuda_health.py:
- CPU environment (no torch / CUDA unavailable) → always healthy
- Healthy GPU (mock synchronize + mem_get_info) → healthy
- CUDA synchronize failure → unhealthy, error cached
- mem_get_info failure → unhealthy
- Failure caching verified (torch not called after first failure)
- GPU availability caching verified
All tests passing: pytest tests/unit/core/test_cuda_health.py

Any specific deployment considerations

Backward compatible — /healthz currently isn't used by any health monitors (they poll / or /info). The companion async-serverless PR switches them to /healthz.
Deploy this first, then the async-serverless config changes. No coordination required.
On CPU-only deployments (cpu-direct, inference without GPU): zero behavioral change, health check returns healthy immediately.

Docs

Docs updated? No documentation changes needed

The /healthz endpoint now verifies CUDA context health when running on GPU by calling torch.cuda.synchronize() (surfaces async errors) and torch.cuda.mem_get_info() (verifies runtime). Returns 503 when CUDA is corrupted. Failure state is cached permanently since CUDA context corruption is unrecoverable -- subsequent health checks return instantly without touching CUDA. On CPU-only servers, behaves exactly as before.

inference/core/interfaces/http/http_api.py

Log the error server-side instead of exposing it in the HTTP response. Addresses CodeQL information-exposure-through-an-exception finding.

hansent requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, probicheaux and yeldarby as code owners April 6, 2026 19:13

github-advanced-security bot found potential problems Apr 6, 2026

View reviewed changes

inference/core/interfaces/http/http_api.py Fixed Show fixed Hide fixed

Remove CUDA error detail from /healthz response body

6b5cd49

Log the error server-side instead of exposing it in the HTTP response. Addresses CodeQL information-exposure-through-an-exception finding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA health checking to /healthz endpoint#2204

Add CUDA health checking to /healthz endpoint#2204
hansent wants to merge 2 commits intomainfrom
cuda-health-check

hansent commented Apr 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hansent commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What changed

New module: inference/core/utils/cuda_health.py

Updated /healthz endpoint

How the recovery chain works after this change

Type of change

How has this change been tested?

Any specific deployment considerations

Docs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hansent commented Apr 6, 2026 •

edited

Loading

New module: `inference/core/utils/cuda_health.py`

Updated `/healthz` endpoint