feat: Add pinned memory optimizer offload for Megatron policy worker by snivertynv · Pull Request #2248 · NVIDIA-NeMo/RL

snivertynv · 2026-04-10T14:21:11Z

This significantly improves performance for the optimizer_offload_before_refit pass which is quite expensive in co-located/syncRL cases.

Enabled/disabled using the use_pinned_optimizer_offload setting (default=disabled). It has been set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py

What does this PR do ?

Optimizer D2H/H2D transfers used per-tensor pageable allocations, causing expensive cudaHostAlloc calls and synchronous memcpy on every step. This adds an opt-in mode (use_pinned_optimizer_offload) that coalesces all optimizer state into a single cached pinned buffer, eliminating cudaHostAlloc from the hot path and enabling non-blocking DMA transfers.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

…olled using the use_pinned_optimizer_offload setting - set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py Signed-off-by: Sriharsha Niverty <sniverty@nvidia.com>

copy-pr-bot · 2026-04-10T14:21:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

snivertynv · 2026-04-10T14:23:42Z

/ok to test 13e7f01

snivertynv self-assigned this Apr 10, 2026

snivertynv requested review from a team and terrykong as code owners April 10, 2026 14:21

snivertynv added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Apr 10, 2026

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 14:24 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 16:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 17:44 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add pinned memory optimizer offload for Megatron policy worker#2248

feat: Add pinned memory optimizer offload for Megatron policy worker#2248
snivertynv wants to merge 1 commit intomainfrom
sniverty/optimizer_offload_perf

snivertynv commented Apr 10, 2026

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

snivertynv commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snivertynv commented Apr 10, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

snivertynv commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant