Skip to content

fix: NVML memory query fallback for DGX Spark#2203

Open
dbuos wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
dbuos:fix/nvml-dgx-spark-fallback
Open

fix: NVML memory query fallback for DGX Spark#2203
dbuos wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
dbuos:fix/nvml-dgx-spark-fallback

Conversation

@dbuos
Copy link
Copy Markdown

@dbuos dbuos commented Apr 3, 2026

Problem

Running GRPO training on NVIDIA DGX Spark (examples/run_grpo.py) crashes during weight refit because nvmlDeviceGetMemoryInfo is not supported on GB10:

pynvml.NVMLError_NotSupported: Not Supported

RuntimeError: Failed to get free memory for device 0 (global index: 0): Not Supported

Tested with the latest nvidia-ml-py==13.595.45 — same error.

Fix

Fall back to torch.cuda.mem_get_info() when the NVML call fails. This API works on all CUDA devices including DGX Spark.

@dbuos dbuos requested a review from a team as a code owner April 3, 2026 11:44
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dbuos dbuos force-pushed the fix/nvml-dgx-spark-fallback branch from d3d46a9 to fa4ec27 Compare April 3, 2026 11:49
…supported

nvmlDeviceGetMemoryInfo returns NVML_ERROR_NOT_SUPPORTED on DGX Spark
(GB10). Log the error and fall back to torch.cuda.mem_get_info which
works on all CUDA devices.

Signed-off-by: Daniel Bustamante Ospina <dbustamante70@gmail.com>
@dbuos dbuos force-pushed the fix/nvml-dgx-spark-fallback branch from db8ae12 to 07a4c07 Compare April 3, 2026 11:57
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants