Add startup resume_if_exists handling for checkpoint reuse by taivu1998 · Pull Request #2206 · NVIDIA-NeMo/RL

taivu1998 · 2026-04-03T17:55:00Z

Summary

This PR adds explicit startup handling for checkpointing.resume_if_exists so training can either resume from the latest active root checkpoint or archive existing root checkpoints and start fresh in the same checkpoint directory.

Why

Issue #345 reported that NeMo RL always resumed from an existing root step_<N> checkpoint when one was present, even when the desired behavior was to reuse the checkpoint directory for a new run. That made it difficult to intentionally cold-start a run without manually moving or deleting prior checkpoints.

Root Cause

Checkpoint startup behavior was effectively hard-coded around get_latest_checkpoint_path(), so the algorithms always interpreted any existing root checkpoint as a resume signal. There was no startup-only API to distinguish:

resume from the active root checkpoint set
archive the active root checkpoint set and start from scratch
ignore existing checkpoints entirely when checkpointing is disabled

What Changed

Added checkpointing.resume_if_exists to the typed checkpoint config.
Added CheckpointManager.resolve_training_start_checkpoint() as the single startup decision point.
Kept get_latest_checkpoint_path() and related lookup helpers pure, so they remain usable without side effects.
When resume_if_exists=false, root step_<N> checkpoints are moved under the next run_<N>/ archive directory and training starts cold.
When checkpointing.enabled=false, startup resume/archive behavior is skipped entirely.
Updated the shared setup paths for GRPO, SFT, DPO, RM, and distillation to use the new startup API.
Updated exemplar configs and standalone NeMo Gym configs to declare resume_if_exists: true explicitly.
Documented the new behavior in the checkpointing design doc.
Added unit coverage for resume, archive, archive numbering, and disabled-checkpointing behavior.
Added a functional regression script that verifies resume_if_exists=false archives the previous run and produces a cold-start SFT run.

Impact

Users can now safely reuse an existing checkpoint directory for a fresh run without manually cleaning up old checkpoints, while preserving previous runs under run_<N>/ for later inspection or conversion.

Validation

python -m py_compile on all changed Python files
bash -n tests/functional/sft_resume_if_exists_false.sh
checkpoint-manager smoke coverage for:
- resume from latest root checkpoint
- archive to run_0/ and run_1/
- disabled-checkpointing no-op startup behavior

Issue

Closes #345.

copy-pr-bot · 2026-04-03T17:55:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add startup resume_if_exists checkpoint handling

1535632

github-actions bot added documentation Improvements or additions to documentation community-request labels Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add startup resume_if_exists handling for checkpoint reuse#2206

Add startup resume_if_exists handling for checkpoint reuse#2206
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-345-resume-if-exists

taivu1998 commented Apr 3, 2026

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented Apr 3, 2026

Summary

Why

Root Cause

What Changed

Impact

Validation

Issue

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants