fix: NVML memory query fallback for DGX Spark by dbuos · Pull Request #2203 · NVIDIA-NeMo/RL

dbuos · 2026-04-03T11:44:47Z

Problem

Running GRPO training on NVIDIA DGX Spark (examples/run_grpo.py) crashes during weight refit because nvmlDeviceGetMemoryInfo is not supported on GB10:

pynvml.NVMLError_NotSupported: Not Supported

RuntimeError: Failed to get free memory for device 0 (global index: 0): Not Supported

Tested with the latest nvidia-ml-py==13.595.45 — same error.

Fix

Fall back to torch.cuda.mem_get_info() when the NVML call fails. This API works on all CUDA devices including DGX Spark.

copy-pr-bot · 2026-04-03T11:44:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…supported nvmlDeviceGetMemoryInfo returns NVML_ERROR_NOT_SUPPORTED on DGX Spark (GB10). Log the error and fall back to torch.cuda.mem_get_info which works on all CUDA devices. Signed-off-by: Daniel Bustamante Ospina <dbustamante70@gmail.com>

dbuos requested a review from a team as a code owner April 3, 2026 11:44

github-actions bot added the community-request label Apr 3, 2026

dbuos force-pushed the fix/nvml-dgx-spark-fallback branch from d3d46a9 to fa4ec27 Compare April 3, 2026 11:49

dbuos force-pushed the fix/nvml-dgx-spark-fallback branch from db8ae12 to 07a4c07 Compare April 3, 2026 11:57

Merge branch 'main' into fix/nvml-dgx-spark-fallback

584947f

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: NVML memory query fallback for DGX Spark#2203

fix: NVML memory query fallback for DGX Spark#2203
dbuos wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
dbuos:fix/nvml-dgx-spark-fallback

dbuos commented Apr 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dbuos commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dbuos commented Apr 3, 2026 •

edited

Loading