Skip to content

Add startup resume_if_exists handling for checkpoint reuse#2206

Draft
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-345-resume-if-exists
Draft

Add startup resume_if_exists handling for checkpoint reuse#2206
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-345-resume-if-exists

Conversation

@taivu1998
Copy link
Copy Markdown

Summary

This PR adds explicit startup handling for checkpointing.resume_if_exists so training can either resume from the latest active root checkpoint or archive existing root checkpoints and start fresh in the same checkpoint directory.

Why

Issue #345 reported that NeMo RL always resumed from an existing root step_<N> checkpoint when one was present, even when the desired behavior was to reuse the checkpoint directory for a new run. That made it difficult to intentionally cold-start a run without manually moving or deleting prior checkpoints.

Root Cause

Checkpoint startup behavior was effectively hard-coded around get_latest_checkpoint_path(), so the algorithms always interpreted any existing root checkpoint as a resume signal. There was no startup-only API to distinguish:

  • resume from the active root checkpoint set
  • archive the active root checkpoint set and start from scratch
  • ignore existing checkpoints entirely when checkpointing is disabled

What Changed

  • Added checkpointing.resume_if_exists to the typed checkpoint config.
  • Added CheckpointManager.resolve_training_start_checkpoint() as the single startup decision point.
  • Kept get_latest_checkpoint_path() and related lookup helpers pure, so they remain usable without side effects.
  • When resume_if_exists=false, root step_<N> checkpoints are moved under the next run_<N>/ archive directory and training starts cold.
  • When checkpointing.enabled=false, startup resume/archive behavior is skipped entirely.
  • Updated the shared setup paths for GRPO, SFT, DPO, RM, and distillation to use the new startup API.
  • Updated exemplar configs and standalone NeMo Gym configs to declare resume_if_exists: true explicitly.
  • Documented the new behavior in the checkpointing design doc.
  • Added unit coverage for resume, archive, archive numbering, and disabled-checkpointing behavior.
  • Added a functional regression script that verifies resume_if_exists=false archives the previous run and produces a cold-start SFT run.

Impact

Users can now safely reuse an existing checkpoint directory for a fresh run without manually cleaning up old checkpoints, while preserving previous runs under run_<N>/ for later inspection or conversion.

Validation

  • python -m py_compile on all changed Python files
  • bash -n tests/functional/sft_resume_if_exists_false.sh
  • checkpoint-manager smoke coverage for:
    • resume from latest root checkpoint
    • archive to run_0/ and run_1/
    • disabled-checkpointing no-op startup behavior

Issue

Closes #345.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added documentation Improvements or additions to documentation community-request labels Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add configurable resume_if_exists flag to control automatic checkpoint resumption

2 participants