Skip to content

Expose fully_parallel_save in save_megatron_model#3207

Open
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:feat/expose-fully-parallel-save
Open

Expose fully_parallel_save in save_megatron_model#3207
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:feat/expose-fully-parallel-save

Conversation

@nic-nvidia
Copy link
Copy Markdown

@nic-nvidia nic-nvidia commented Apr 8, 2026

Summary

Add fully_parallel_save parameter to AutoBridge.save_megatron_model() and the underlying model_load_save.save_megatron_model(), forwarded to CheckpointConfig.

Defaults to True — no behavior change for existing callers.

Problem

save_megatron_model() always creates a CheckpointConfig with fully_parallel_save=True (the dataclass default). This activates FullyParallelSaveStrategyWrapper, which calls all_gather_object on DP sub-groups. When the distributed world includes ranks that never enter the save path (e.g., vLLM inference workers in NeMo RL non-colocated setups), these collectives deadlock permanently.

Callers have no way to disable this behavior through the public API.

Fix

Thread fully_parallel_save: bool = True through both layers:

  • AutoBridge.save_megatron_model()model_load_save.save_megatron_model()CheckpointConfig(..., fully_parallel_save=...)

Motivation

Needed by NVIDIA-NeMo/RL#2226 which fixes a deadlock during HF-to-Megatron checkpoint conversion in non-colocated training/inference setups.

Test plan

Summary by CodeRabbit

  • New Features
    • Added a configurable option for model checkpoint saving in distributed training scenarios, allowing control over parallel save behavior during the checkpoint process.

Add fully_parallel_save parameter to AutoBridge.save_megatron_model()
and model_load_save.save_megatron_model(), forwarded to CheckpointConfig.

Defaults to True (no behavior change). Callers can pass False to disable
FullyParallelSaveStrategyWrapper, which deadlocks when the distributed
world includes ranks that do not participate in the save (e.g., vLLM
inference workers in NeMo RL non-colocated setups).

Needed by NVIDIA-NeMo/RL#2226.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e2d489e-a80c-4001-b51b-88a012a4a824

📥 Commits

Reviewing files that changed from the base of the PR and between 00ddf16 and f010b59.

📒 Files selected for processing (2)
  • src/megatron/bridge/models/conversion/auto_bridge.py
  • src/megatron/bridge/training/model_load_save.py

📝 Walkthrough

Walkthrough

A new optional parameter fully_parallel_save: bool = True was added to save_megatron_model functions across two files. This parameter controls whether all Data Parallel ranks must participate in save collectives, with the flag being propagated through the call chain from the conversion API to the core training save function.

Changes

Cohort / File(s) Summary
Save Model Parameter Addition
src/megatron/bridge/models/conversion/auto_bridge.py, src/megatron/bridge/training/model_load_save.py
Added fully_parallel_save: bool = True parameter to save_megatron_model functions. Updated docstrings to document how the flag controls collective participation requirements for DP ranks during save operations. Parameter is propagated through the call chain to control checkpoint save strategy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces new parameter to public APIs but validation is explicitly marked as pending; default behavior verified but full test results not yet documented. Complete pending validation on NeMo RL non-colocated 120B recipe and document test results in PR description before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: exposing the fully_parallel_save parameter in the save_megatron_model function.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…_megatron_model

The all_gather_object in determine_global_metadata (validation.py:518) uses
the default PG. When some ranks take longer to build their state dict (e.g.,
due to expert parallelism), the collective times out. For one-time conversion
saves, validation is unnecessary and can be safely skipped.

Also adds distributed_timeout_minutes for callers that need longer timeouts
during large model saves.
@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Apr 8, 2026

/ok to test 664a8a8

@yaoyu-33 yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Apr 8, 2026
@yaoyu-33 yaoyu-33 enabled auto-merge (squash) April 8, 2026 22:56
@yaoyu-33 yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Apr 8, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented Apr 9, 2026

@nic-nvidia plz fix test fail

@yaoyu-33 yaoyu-33 added needs-author Author action is required before review or merge can continue feature New capabilities, enhancements, or enablement work area:ckpt Checkpoint conversion, loading, export, and save paths labels Apr 9, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths community-request feature New capabilities, enhancements, or enablement work needs-author Author action is required before review or merge can continue needs-follow-up Issue needs follow-up needs-more-tests Requires additional L0 and L1 test coverage before merge ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants