feat: add SDPO algorithm by celineltan · Pull Request #2217 · NVIDIA-NeMo/RL

celineltan · 2026-04-06T05:10:19Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

This PR aims to add an implementation for Self-Distillation Policy Optimization (https://arxiv.org/pdf/2601.20802).

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-04-06T05:10:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Implements SDPO from Hübotter et al. (2026), "Reinforcement Learning via Self-Distillation" (arXiv:2601.20802). Signed-off-by: Celine Tan <tancelinetan@gmail.com>

- SDPOLossFn: next_token_logprobs from the training forward already has shape [B, S-1] (next-token convention), so do not apply [:, 1:] again. teacher_logprobs / prev_logprobs come from get_logprobs() [B, S] and still need [:, 1:]. Without this fix training crashes with a tensor size mismatch (e.g. 1022 vs 1023). - SDPOLossFn: add num_valid_samples to the returned metrics dict; DTensorPolicyWorkerV2.train() requires this key. - sdpo_train: replace the non-existent checkpointer.maybe_save_checkpoint / checkpointer.save_checkpoint / timeout.update / timeout.timed_out calls with the CheckpointManager / TimeoutChecker API that actually exists: init_tmp_checkpoint + finalize_checkpoint, mark_iteration + check_save. - sdpo_train: pass step argument to validate() calls. Signed-off-by: Celine Tan <tancelinetan@gmail.com>

celineltan requested review from a team as code owners April 6, 2026 05:10

github-actions bot added the community-request label Apr 6, 2026

celineltan added 2 commits April 6, 2026 05:11

feat: add SDPO (Self-Distilled Policy Optimization) algorithm

9ff858f

Implements SDPO from Hübotter et al. (2026), "Reinforcement Learning via Self-Distillation" (arXiv:2601.20802). Signed-off-by: Celine Tan <tancelinetan@gmail.com>

celineltan force-pushed the feat/sdpo branch from e5a63cd to 929cb17 Compare April 6, 2026 05:11

celineltan changed the title ~~Feat/sdpo~~ feat: add SDPO algorithm Apr 6, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SDPO algorithm#2217

feat: add SDPO algorithm#2217
celineltan wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
celineltan:feat/sdpo

celineltan commented Apr 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

celineltan commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

celineltan commented Apr 6, 2026 •

edited

Loading