Open
Conversation
The /healthz endpoint now verifies CUDA context health when running on GPU by calling torch.cuda.synchronize() (surfaces async errors) and torch.cuda.mem_get_info() (verifies runtime). Returns 503 when CUDA is corrupted. Failure state is cached permanently since CUDA context corruption is unrecoverable -- subsequent health checks return instantly without touching CUDA. On CPU-only servers, behaves exactly as before.
Log the error server-side instead of exposing it in the HTTP response. Addresses CodeQL information-exposure-through-an-exception finding.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a GPU encounters a fatal CUDA error (e.g., "illegal memory access"), the CUDA context is permanently corrupted — all subsequent GPU operations fail. But the current
/healthzendpoint always returns{"status": "healthy"}, so pods keep receiving work and failing on every request until something else (timeout accumulation, OOM) eventually triggers a restart.This PR makes
/healthzactually validate GPU health by checking the CUDA runtime on each call.Companion PR: [roboflow/async-serverless#227 (points consumer health monitors, k8s probes, and sidecar at
/healthz)What changed
New module:
inference/core/utils/cuda_health.pyA
CudaHealthCheckersingleton that:torch.cuda.synchronize()to surface any pending asynchronous CUDA errors (the most common way "illegal memory access" manifests — CUDA errors are reported asynchronously)torch.cuda.mem_get_info()to verify the CUDA runtime is still queryabletorch.cuda.is_available()is checked once and cached. Health check always returns healthy, zero overhead.Performance: ~2-15 microseconds when healthy (CUDA synchronize + mem_get_info), nanoseconds after failure or on CPU.
Updated
/healthzendpointNow returns 503 with
{"status": "unhealthy", "reason": "cuda_error", "detail": "..."}when CUDA is broken. Uses lazy import matching existing patterns in the codebase (base.py:680).How the recovery chain works after this change
The existing async-serverless health monitoring chain already has all the right failure-reaction logic — it just never got a failure signal because
/healthzwas blind to CUDA state.Type of change
How has this change been tested?
tests/unit/core/test_cuda_health.py:pytest tests/unit/core/test_cuda_health.pyAny specific deployment considerations
/healthzcurrently isn't used by any health monitors (they poll/or/info). The companion async-serverless PR switches them to/healthz.Docs