Skip to content

feat: add SDPO algorithm#2217

Open
celineltan wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
celineltan:feat/sdpo
Open

feat: add SDPO algorithm#2217
celineltan wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
celineltan:feat/sdpo

Conversation

@celineltan
Copy link
Copy Markdown

@celineltan celineltan commented Apr 6, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

This PR aims to add an implementation for Self-Distillation Policy Optimization (https://arxiv.org/pdf/2601.20802).

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@celineltan celineltan requested review from a team as code owners April 6, 2026 05:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Implements SDPO from Hübotter et al. (2026), "Reinforcement Learning
via Self-Distillation" (arXiv:2601.20802).

Signed-off-by: Celine Tan <tancelinetan@gmail.com>
- SDPOLossFn: next_token_logprobs from the training forward already has
  shape [B, S-1] (next-token convention), so do not apply [:, 1:] again.
  teacher_logprobs / prev_logprobs come from get_logprobs() [B, S] and
  still need [:, 1:].  Without this fix training crashes with a tensor
  size mismatch (e.g. 1022 vs 1023).

- SDPOLossFn: add num_valid_samples to the returned metrics dict;
  DTensorPolicyWorkerV2.train() requires this key.

- sdpo_train: replace the non-existent checkpointer.maybe_save_checkpoint /
  checkpointer.save_checkpoint / timeout.update / timeout.timed_out calls
  with the CheckpointManager / TimeoutChecker API that actually exists:
  init_tmp_checkpoint + finalize_checkpoint, mark_iteration + check_save.

- sdpo_train: pass step argument to validate() calls.

Signed-off-by: Celine Tan <tancelinetan@gmail.com>
@celineltan celineltan changed the title Feat/sdpo feat: add SDPO algorithm Apr 6, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants