[model] fix: fix gpt oss export#3271
Conversation
43e3c2c to
8e048be
Compare
…ert weights (NVIDIA-NeMo#3250)" This reverts commit b139a4b. Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
8e048be to
c4d5eff
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
💤 Files with no reviewable changes (2)
📝 WalkthroughWalkthroughThe PR removes GPT-OSS-specific export transpose handling that was previously applied globally during HF safetensors weight export. This includes deleting utility functions ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/ok to test c4d5eff |
Summary
Move the transpose of
down_projfor GPT-OSS expert weights to NeMo-RL side instead of bridge side to avoid mismatch layout when saving checkpoint.History:
down_proj. The import-side transpose was removed, leaving only the export transpose inGPTOSSMLPDownProjMapping.megatron_to_hf. This helps fixing Megatron[in, out]layout and vLLM[out, in]layout.[out, in]layout produced by the bridge transpose corrupted saved checkpoints (save_hf_pretrainedwrote weights that didn't match the original HF[in, out]layout).Root cause: The transpose is a NeMo-RL refit concern, not a bridge concern. The bridge's
megatron_to_hfshould always produce standard HF layout.Changes
transpose_on_exportadaptive wrapper inGPTOSSMLPDownProjMapping).down_projtranspose fromGPTOSSMLPDownProjMapping.megatron_to_hf. The export path now returns weights in the correct HF[in, out]layout for checkpoint saving.Testing
Validated. See validate steps and results in NVIDIA-NeMo/RL#2249.
Summary by CodeRabbit
Refactor
Chores