Add startup resume_if_exists handling for checkpoint reuse#2206
Draft
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Draft
Add startup resume_if_exists handling for checkpoint reuse#2206taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
taivu1998 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds explicit startup handling for
checkpointing.resume_if_existsso training can either resume from the latest active root checkpoint or archive existing root checkpoints and start fresh in the same checkpoint directory.Why
Issue #345 reported that NeMo RL always resumed from an existing root
step_<N>checkpoint when one was present, even when the desired behavior was to reuse the checkpoint directory for a new run. That made it difficult to intentionally cold-start a run without manually moving or deleting prior checkpoints.Root Cause
Checkpoint startup behavior was effectively hard-coded around
get_latest_checkpoint_path(), so the algorithms always interpreted any existing root checkpoint as a resume signal. There was no startup-only API to distinguish:What Changed
checkpointing.resume_if_existsto the typed checkpoint config.CheckpointManager.resolve_training_start_checkpoint()as the single startup decision point.get_latest_checkpoint_path()and related lookup helpers pure, so they remain usable without side effects.resume_if_exists=false, rootstep_<N>checkpoints are moved under the nextrun_<N>/archive directory and training starts cold.checkpointing.enabled=false, startup resume/archive behavior is skipped entirely.resume_if_exists: trueexplicitly.resume_if_exists=falsearchives the previous run and produces a cold-start SFT run.Impact
Users can now safely reuse an existing checkpoint directory for a fresh run without manually cleaning up old checkpoints, while preserving previous runs under
run_<N>/for later inspection or conversion.Validation
python -m py_compileon all changed Python filesbash -n tests/functional/sft_resume_if_exists_false.shrun_0/andrun_1/Issue
Closes #345.