Expose fully_parallel_save in save_megatron_model by nic-nvidia · Pull Request #3207 · NVIDIA-NeMo/Megatron-Bridge

nic-nvidia · 2026-04-08T06:11:07Z

Summary

Add fully_parallel_save parameter to AutoBridge.save_megatron_model() and the underlying model_load_save.save_megatron_model(), forwarded to CheckpointConfig.

Defaults to True — no behavior change for existing callers.

Problem

save_megatron_model() always creates a CheckpointConfig with fully_parallel_save=True (the dataclass default). This activates FullyParallelSaveStrategyWrapper, which calls all_gather_object on DP sub-groups. When the distributed world includes ranks that never enter the save path (e.g., vLLM inference workers in NeMo RL non-colocated setups), these collectives deadlock permanently.

Callers have no way to disable this behavior through the public API.

Fix

Thread fully_parallel_save: bool = True through both layers:

AutoBridge.save_megatron_model() → model_load_save.save_megatron_model() → CheckpointConfig(..., fully_parallel_save=...)

Motivation

Needed by NVIDIA-NeMo/RL#2226 which fixes a deadlock during HF-to-Megatron checkpoint conversion in non-colocated training/inference setups.

Test plan

Default behavior unchanged (fully_parallel_save=True)
Validate with NeMo RL non-colocated 120B recipe (Fix save_megatron_model deadlock: pass fully_parallel_save=False during conversion RL#2226)

Summary by CodeRabbit

New Features
- Added a configurable option for model checkpoint saving in distributed training scenarios, allowing control over parallel save behavior during the checkpoint process.

Add fully_parallel_save parameter to AutoBridge.save_megatron_model() and model_load_save.save_megatron_model(), forwarded to CheckpointConfig. Defaults to True (no behavior change). Callers can pass False to disable FullyParallelSaveStrategyWrapper, which deadlocks when the distributed world includes ranks that do not participate in the save (e.g., vLLM inference workers in NeMo RL non-colocated setups). Needed by NVIDIA-NeMo/RL#2226.

copy-pr-bot · 2026-04-08T06:11:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-08T06:14:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5e2d489e-a80c-4001-b51b-88a012a4a824

📥 Commits

Reviewing files that changed from the base of the PR and between 00ddf16 and f010b59.

📒 Files selected for processing (2)

src/megatron/bridge/models/conversion/auto_bridge.py
src/megatron/bridge/training/model_load_save.py

📝 Walkthrough

Walkthrough

A new optional parameter fully_parallel_save: bool = True was added to save_megatron_model functions across two files. This parameter controls whether all Data Parallel ranks must participate in save collectives, with the flag being propagated through the call chain from the conversion API to the core training save function.

Changes

Cohort / File(s)	Summary
Save Model Parameter Addition `src/megatron/bridge/models/conversion/auto_bridge.py`, `src/megatron/bridge/training/model_load_save.py`	Added `fully_parallel_save: bool = True` parameter to `save_megatron_model` functions. Updated docstrings to document how the flag controls collective participation requirements for DP ranks during save operations. Parameter is propagated through the call chain to control checkpoint save strategy.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces new parameter to public APIs but validation is explicitly marked as pending; default behavior verified but full test results not yet documented.	Complete pending validation on NeMo RL non-colocated 120B recipe and document test results in PR description before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: exposing the fully_parallel_save parameter in the save_megatron_model function.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…_megatron_model The all_gather_object in determine_global_metadata (validation.py:518) uses the default PG. When some ranks take longer to build their state dict (e.g., due to expert parallelism), the collective times out. For one-time conversion saves, validation is unnecessary and can be safely skipped. Also adds distributed_timeout_minutes for callers that need longer timeouts during large model saves.

yaoyu-33 · 2026-04-08T22:56:00Z

/ok to test 664a8a8

yaoyu-33 · 2026-04-09T06:20:53Z

@nic-nvidia plz fix test fail

github-actions bot added the community-request label Apr 8, 2026

nic-nvidia mentioned this pull request Apr 8, 2026

Fix save_megatron_model deadlock: pass fully_parallel_save=False during conversion NVIDIA-NeMo/RL#2226

Open

3 tasks

yaoyu-33 approved these changes Apr 8, 2026

View reviewed changes

yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Apr 8, 2026

yaoyu-33 enabled auto-merge (squash) April 8, 2026 22:56

copy-pr-bot bot temporarily deployed to test April 8, 2026 22:56 Inactive

yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Apr 8, 2026

copy-pr-bot bot temporarily deployed to nemo-ci April 8, 2026 22:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 8, 2026 23:39 Failure

copy-pr-bot bot temporarily deployed to nemo-ci April 8, 2026 23:39 Inactive

yaoyu-33 added needs-author Author action is required before review or merge can continue feature New capabilities, enhancements, or enablement work area:ckpt Checkpoint conversion, loading, export, and save paths labels Apr 9, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose fully_parallel_save in save_megatron_model#3207

Expose fully_parallel_save in save_megatron_model#3207
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:feat/expose-fully-parallel-save

nic-nvidia commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

yaoyu-33 commented Apr 8, 2026

Uh oh!

yaoyu-33 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nic-nvidia commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Motivation

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

yaoyu-33 commented Apr 8, 2026

Uh oh!

yaoyu-33 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nic-nvidia commented Apr 8, 2026 •

edited by coderabbitai bot

Loading