Skip to content

[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390#1385

Merged
kevalmorabia97 merged 24 commits intorelease/0.44.0from
cherry-picks/release-0.44.0
May 5, 2026
Merged

[Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390#1385
kevalmorabia97 merged 24 commits intorelease/0.44.0from
cherry-picks/release-0.44.0

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented May 4, 2026

Cherry-picked PRs

Summary by CodeRabbit

  • New Features

    • Added Python 3.14 support (basic unit tests verified; production defaults on Python 3.12)
    • Added Windows CUDA 13.x installation guidance
    • Introduced LLM ONNX export utilities with quantization support
    • Extended Medusa mode support in speculative decoding pipeline
  • Bug Fixes

    • Fixed FP8 quantization for vision transformer multi-head attention
    • Improved MoE expert handling in quantization calibration and inference
    • Enhanced ONNX graph utilities for FP8 weight transformation
  • Documentation

    • Comprehensive Minitron pruning + distillation + quantization + vLLM tutorials with ablation studies
    • Megatron data preparation guide for tokenization workflows
    • Puzzletron distillation results and cross-reference updates

kevalmorabia97 and others added 17 commits May 4, 2026 12:49
Adds `.claude/skills/release-cherry-pick/SKILL.md` — a Claude Code skill
for cherry-picking labeled PRs to a release branch.

Invoke with `/release-cherry-pick <version>`.

See this PR created with the skill:
#1350

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automated release cherry-pick workflow to streamline selecting
and applying multiple PRs into release branches.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…#1351)

PR #1289 (FP8 MHA quantization for ViT) was merged to `main` after
`0.44.0rc1` was tagged, so the rc1 wheel ships without the
`nn.LayerNorm` registration that the example's `_FP8_MHA_OVERRIDE` now
references — surfaced as nvbug 6114983 (`ValueError: parent_class
'nn.LayerNorm' not found in QuantModuleRegistry` when running
`torch_quant_to_onnx.py --quantize_mode=fp8`). PR #1289 is labeled
`cherry-pick-0.44.0` and will be cherry-picked to `release/0.44.0` for
the next rc, so the feature ships in 0.44 — this PR moves the
corresponding release-notes bullet from the `0.45 (Future)` section to
`0.44 (2026-05-xx)` to match.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…int (#1330)

### What does this PR do?

Type of change: ? Bug fix

Fixes
`https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/puzzletron/main.py`
where multi-GPU run caused only part of the file
`model.safetensors.index.json` to be written to disk.

### Usage

does not apply

### Testing

Follow [instructions, step
3](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron#compress-the-model)
- run with `--nproc_per_node 2`

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a public checkpoint-saving entry that consolidates distributed
sharded model shards into a single filesystem checkpoint; retains direct
saving for single-process runs.

* **Refactor**
* Validation/evaluation tooling now uses the consolidated
checkpoint-saving flow when persisting realized model checkpoints during
runs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1354)

## Summary

- NVBug: [6103846](https://nvbugspro.nvidia.com/bug/6103846) —
`Qwen3-30B-A3B nvfp4_awq` quantization fails at export with
`AssertionError: Modules have different quantization formats`.
- Root cause: in `model_calib.awq_lite`, MoE experts that end up
disabled (NaN in act/weight scales, or no search-pass tokens) get
`max_calibrate`-d but no `pre_quant_scale`. `get_quantization_format`
then returns `nvfp4` for those experts while siblings stay `nvfp4_awq`.
`unified_export_hf.requantize_resmooth_fused_llm_layers` groups all 128
experts of each linear name (gate_proj/down_proj/up_proj) and calls
`preprocess_linear_fusion(..., resmooth_only=True)`, which asserts
uniform format → fires for any single mismatched expert.
- Fix: unify the disabled-expert paths in the awq_lite postprocess loop
so any expert with `is_enabled == False` (no cache hits, NaN scales, or
no search-pass tokens) receives `max_calibrate` + a neutral all-ones
`pre_quant_scale`, matching the existing behavior for `num_cache_steps
== 0`. Emit a warning so users notice that calibration coverage is
incomplete and accuracy may degrade.

## Test plan

- [x] `pytest tests/unit/torch/quantization/test_calib.py -k 'awq'` → 5
passed
- [x] End-to-end on `Qwen/Qwen3-30B-A3B` with `NVFP4_AWQ_LITE_CFG` and a
small calib set that leaves many experts uncalibrated:
- All 6144 gate_proj/up_proj/down_proj expert linears report `nvfp4_awq`
(no mismatch)
  - `export_hf_checkpoint` succeeds with no `AssertionError`
- The new "Forcing pre_quant_scale=1 ... may degrade accuracy" warning
fires for each affected expert
- [x] Re-run via `examples/llm_ptq/hf_ptq.py` with the bug-report CLI
(cnn_dailymail, batch_size=8, calib_size=64 — scaled down from 512 to
fit budget) on B200:
- 36 "the second time did not forward data through
..experts.X.{gate,up,down}_proj" warnings — i.e. the exact
bug-triggering condition from the original NVBug log naturally
reproduces
- 2058 "Forcing pre_quant_scale=1" warnings — fix path activates for
uncalibrated/disabled experts
  - 0 `AssertionError`s — export completes
- `Quantized model exported to: /tmp/test_plan_qwen3-30b-a3b-nvfp4_awq`
and post-PTQ generation works

---------

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

Previously the regex we had looked for a dot after the integer layer
number, but it might not exist sometimes.

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved detection and handling of pipeline-parallel layer indices in
model names to correctly support layer identifiers positioned at the end
of submodule names, enhancing compatibility with various naming
conventions in distillation workflows.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: ? Bug fix

The config
`examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/pruning/attn_pruning.yaml`
didn't have required keys to use attention pruning in the example
`examples/puzzletron/main.py`

### Usage

### Testing

In
`examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/Llama-3_1-8B.yaml`
change `ffn_pruning` to `attn_pruning`

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated pruning configuration for improved KV-head pruning support,
including enhanced importance hook settings and attention output
handling for memory optimization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

- Enable EP (expert parallelism) import for HF to MCore when using TE
Spec
- Fix bug in mamba moe config which doesn't skip attention layers
properly in MCore (Mcore uses different naming for attention layers than
HF)
- Add getter for Quant Config (used in MLM modelopt examples to get
quant cfg fields)

### Usage

```python
# In Megatron-LM/examples/post_training/modelopt
 MLM_EXTRA_ARGS="--export-default-te-spec --trust-remote-code --moe-router-dtype fp32" EP=4 HF_MODEL_CKPT=</path/to/hf> MLM_MODEL_SAVE=<save/path> ./convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Corrected expert-slice assignment so each expert-parallel rank loads
the proper expert slice.
* Improved detection of pipeline-parallel layer indices in submodule
names.

* **Improvements**
* Relaxed constraints between local and global expert counts for
grouped-local-expert imports.
* Added typed helpers for managing quantization configuration entries
and expanded quantizer disable patterns.
  * Exporter now accepts an additional hybrid model type when available.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…ributeError on custom configs (#1324)

### What does this PR do?

Type of change: Bug fix <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

<!-- Details about the change. -->

- Summary: Running hf_ptq.py on stepfun-ai/Step-3.5-Flash (and any model
whose custom HF config doesn't assign use_cache) crashed in
get_max_batch_size() with AttributeError: 'Step3p5Config' object has no
attribute 'use_cache' before calibration could start.
- Extract the existing "disable KV cache during calibration" logic into
a _disable_use_cache(model) context manager, apply it to both
get_max_batch_size and _forward_loop. The CM sets config.use_cache =
False unconditionally (not only when the attribute exists) and restores
the prior value on exit if one was set.
- Behavior unchanged for normal configs; the NemotronH hybrid-cache
correctness guarantee from #1251 is preserved.

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

Step-3.5-Flash PTQ now passes get_max_batch_size

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Improved memory handling during model evaluation and calibration by
consistently disabling KV cache for both single-batch probes and full
dataloader runs, simplifying and stabilizing inference flow and ensuring
cache state is managed reliably.

* **Tests**
* Added unit tests verifying cache-state handling across models with and
without cache settings, including correct restoration behavior even when
errors occur.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…(Qwen3.5-MoE VLM) (#1340)

### What does this PR do?

Type of change: Bug fix

Fixes a 4-bug cascade that caused silent PTQ failure on Qwen3.5-MoE VLMs
(Qwen3.6-35B-A3B): calibration
appeared to succeed but produced token-salad at inference. Root cause:
HF's @use_experts_implementation
dispatches expert forward to torch._grouped_mm / torch.bmm, bypassing
the F.linear hook that captures
activations — so gate_up_proj_input_quantizer /
down_proj_input_quantizer never calibrated and no input_scale
  tensors were emitted.

  Changes:
- examples/llm_ptq/hf_ptq.py — force config._experts_implementation =
"eager" (recursing into text_config /
vision_config / …) so per-expert F.linear calls are visible to the
calibration hook.
- modelopt/torch/quantization/conversion.py — normalize plural
ModuleList quantizer names (weight_quantizers.N
→ weight_quantizer) before fnmatch, so wildcards like
*mlp.experts*weight_quantizer match fused-expert
  quantizers.
- modelopt/torch/export/unified_export_hf.py — hoist the
_QuantFusedExperts export branch above the
get_quantization_format() gate so _export_fused_experts() runs even when
the top-level format query returns
  QUANTIZATION_NONE (happens for experts-only recipes).
- modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml —
layerwise: false (VLM nested layer structure
  breaks the layerwise walker).

<!-- Details about the change. -->

### Usage

```python
  python examples/llm_ptq/hf_ptq.py \
      --pyt_ckpt_path Qwen/Qwen3.6-35B-A3B \
      --qformat nvfp4 \
      --kv_cache_qformat fp8 \
      --calib_size 512 \
      --export_path Qwen3.6-35B-A3B-NVFP4
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

  Testing

End-to-end PTQ → vLLM deploy → NEL eval on Qwen3.6-35B-A3B (256 experts
× 40 layers, 35B params):

Hook-call diagnostic: 0 → 6720 per-expert F.linear calls during
calibration after the fix; 0 → 30720
  input_scale tensors emitted in the exported checkpoint.

FP8 fused-MoE path still produces gibberish — separate follow-up (vLLM
per-expert weight_scale handling).
* vLLM full-FP8: the FlashInfer TRTLLM Fp8MoE loader doesn't stack the
256 per-expert scalar weight_scale tensors
into a [num_experts] per-expert vector — it ends up applying one
expert's scale across all 256, so every
routed expert dequants with the wrong amplitude → coherent token stream
collapses into multilingual gibberish.

* SGLang full-FP8: qwen3_5.py::_make_packed_weight_loader rejects with
AssertionError: Unexpected scalar for
tuple shard load: loaded_shard_id=(0,1,2), split_sizes=[1,1,1] — its
packed-loader has no path for "N
independent per-tensor source scalars combining into one fused-shard
parameter," so the fused QKV (or
in_proj_qkvz) load is structurally refused and the model never finishes
loading.
### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Better fused-expert export flow, a plugin to force eager expert
execution during calibration/export, and a representative quantizer
discovery utility.

* **Bug Fixes**
* Reliable matching/discovery of per-expert indexed quantizers enabling
correct calibration and mixed-precision export; fixes for calibration in
nested decoder layouts.

* **Documentation**
  * Clarified PTQ config guidance on layerwise calibration.

* **Tests**
* Added fused-experts calibration, export, and name-normalization tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…n with Cuda13.x (#1368)

### What does this PR do?

Type of change: ? documentation

<!-- Details about the change. -->
Update windows documentation for onnxruntime quantization with Cuda13.x

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?:  N/A <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!---
Mandatory -->
- Did you write any new necessary tests?: N/A <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Updated Windows installation guide with CUDA 13.x-specific setup
instructions for GPU-accelerated dependencies, including CuPy and ONNX
Runtime configuration with nightly builds.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: ynankani <ynankani@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Type of change: Bug fix <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

<!-- Details about the change. -->

The offline dflash regression test can be runned on 1 or 2 gpus. For 2
gpus, the total steps is half of 1 gpu. This PR relax the failing
threshold for 2 gpu tests.

```python
```
<!-- Mention how have you tested your change if applicable. -->

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
<!-- E.g. related issue. -->

Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Minor bug fix

- Put quantization steps inside try-finally to ensure removal of temp
files on error in ONNX INT4 quantization.
- To avoid redundancy between awq_lite() and awq_clip() methods, created
a utility _remove_augmented_onnx() for exception-handling based removal
of augmented onnx file and its data file.

### Testing

- Locally performed ONNX INT4 awq-lite and awq-clip quantization with
Llama 1B model.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Improved reliability of the quantization pipeline by ensuring
temporary conversion artifacts are always removed, making cleanup more
robust.
* Consolidated handling of external-data companions and added safer
deletion behavior that logs failures instead of raising errors.
* Ensured consistent session teardown and forced memory collection to
reduce resource leakage and intermittent errors during model conversion.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: vipandya <vipandya@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Remove return statement from the code checking remote auto tuning config
arguments since that results in skipping adding the actual remote tuning
config to the trtexec cmd.

**Root cause**: The necessary flags do not get added to
`self._base_cmd.extend(trtexec_args)` when remote autotuning is enabled.

**Before fix**:
```
['trtexec', '--avgRuns=100', '--iterations=100', '--warmUp=50', '--stronglyTyped', \
    '--saveEngine=engine.trt', '--timingCacheFile=trtexec_timing.cache', \
    '--onnx=baseline.onnx']
```
**After fix**:
```
['trtexec', '--avgRuns=100', '--iterations=100', '--warmUp=50', '--stronglyTyped', \
    '--saveEngine=engine.trt', '--timingCacheFile=trtexec_timing.cache', \
    '--remoteAutoTuningConfig=$CONFIG', '--safe', '--skipInference', \
    '--onnx=baseline.onnx']
```
Notice that the remote autotuning and related flags are now included in
the `trtexec` command.

**Related PR**: #1259

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Bug Fixes
* Fixed an issue where remote autotuning configuration arguments were
not being properly included in benchmark commands, ensuring all remote
autotuning settings are now correctly applied during execution.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: dmoodie <dmoodie@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…val + Quantize + vLLM deployment (#1325)

## Summary

End-to-end optimization walkthrough for Nemotron-Nano-9B-v2 showing how
ModelOpt techniques stack:

- **Pruning** — Minitron structured pruning 9B → 7B
- **Distillation** — Megatron-Bridge knowledge distillation up to 80B
tokens; near-parity with official 9B on MMLU Pro, GPQA, LCB, AIME, Math
500, IFEval, SciCode
- **Evaluation** - using nemo-evaluator
- **Quantization** — FP8 PTQ via \`hf_ptq.py\`; checkpoint deployable on
vLLM/TRT-LLM/SGLang with no extra flags (quantization auto-detected from
\`config.json\`)
- **vLLM Throughput** — BF16 vs FP8 benchmark on single H100

<img width="2085" height="1740" alt="image"
src="https://github.com/user-attachments/assets/8620a019-5c09-4a6b-a5d2-ca164aaa5d87"
/>

<img width="2085" height="810" alt="image"
src="https://github.com/user-attachments/assets/742c8035-f1fb-4394-b11b-0c6c3ac4e843"
/>

### Files changed

- `examples/pruning/minitron/README.md` — index page for Minitron
end-to-end tutorials
- `examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md` —
full repro doc with 6 sections: data prep, pruning, distillation,
evaluation, FP8 quantization, vLLM benchmarking
-
`examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml`
— NeMo Evaluator config used for all benchmark numbers
- `examples/pruning/puzzletron/README.md` — index page for Puzzletron
distillation results
- `examples/pruning/puzzletron/Llama-3.1-8B-Instruct.md` — Puzzletron
distillation results (renamed from puzzletron.md)
- `examples/pruning/README.md` — updated Results section with direct
links to new locations
- `examples/megatron_bridge/README.md` — updated results link to point
to `examples/pruning/`
- `examples/puzzletron/README.md` — updated distillation results link
- `examples/dataset/MEGATRON_DATA_PREP.md` — tokenization commands for
all datasets used in the data blend

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Documentation

* **New end-to-end tutorial** for model optimization covering Minitron
pruning, knowledge distillation, FP8 quantization, and vLLM deployment
with reproducibility steps and benchmark results
* **Dataset preparation guide** with ready-to-run tokenization templates
for Nemotron HuggingFace datasets
* **Evaluation configuration** and results documentation including
ablation studies across multiple benchmarks
* **Updated navigation** across pruning, distillation, and dataset
examples to streamline user workflows

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

CUDNN 9.21 added a new dll dependency called
cudnn_engines_tensor_ir64_9.dll that ort.preload_dlls() is not updated
on for windows to load this dll hence fails trying to load cudnn when
just nvidia-cudnn-cu12>9.20 package is used.
So added code to add any extra dlls from the site-packages folder that
the preload function misses.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved Windows cuDNN detection and loading for ONNX Runtime with
CUDA by scanning installed cuDNN packages and attempting to load any
missing DLLs to reduce startup failures.
* Enhanced logging and diagnostics: preload output is now surfaced as
warnings and individual DLL load successes/failures are logged to aid
troubleshooting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Problem
When `training.mode == "medusa"` is used in `main.py`, the `data_module`
variable is never assigned because line 344 only covered `eagle3` and
`dflash` modes. This causes an `UnboundLocalError` when the trainer is
constructed with `**data_module`.

Fixes OMNIML-4147

## Fix
Add `"medusa"` to the `training_args.mode in ("eagle3", "dflash")`
condition so `data_module` is correctly populated for medusa training.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed speculative decoding example to properly handle "medusa" mode
alongside existing "eagle3" and "dflash" modes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1371)

## Problem
When training with a model that has no `chat_template` in its tokenizer
(e.g. base Llama-3.2 models), `_post_process_chat_template()` crashes:
```
AttributeError: 'NoneType' object has no attribute 'replace'
```
The DeepSeek WAR at the top of `_post_process_chat_template` called
`.replace()` directly on `self.tokenizer.chat_template` without checking
for `None` first.

Fixes NVBug 6120958

## Fix
Add an early return when `chat_template is None`. The existing check at
line 164 (`if self.tokenizer.chat_template is None: raise ValueError`)
still provides a clear error message if no valid template is available
after post-processing.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Fixed a crash in chat template processing that occurred when a chat
template configuration was not set, improving system stability and
reliability during initialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners May 4, 2026 19:53
@kevalmorabia97 kevalmorabia97 requested review from yeyu-nvidia and removed request for a team May 4, 2026 19:53
@kevalmorabia97 kevalmorabia97 requested review from cjluo-nv and removed request for a team May 4, 2026 19:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

This PR introduces Python 3.14 support, adds a Claude AI skill for automated cherry-picking of PRs to release branches, documents an end-to-end Minitron pruning-to-deployment tutorial, enhances MoE expert quantization handling across export and plugins, adds distributed checkpoint saving for large models, introduces new ONNX export utilities, improves dataset utilities with KV cache management, and fixes Windows ONNX Runtime cuDNN DLL loading.

Changes

Python 3.14 Support

Layer / File(s) Summary
Version Constraints
pyproject.toml, noxfile.py
requires-python upper bound and Nox unit session matrix expanded to include Python 3.14.
CI/Documentation
.github/workflows/unit_tests.yml, .github/workflows/gpu_tests.yml, docs/source/getting_started/_installation_for_Linux.rst
Workflow job matrices and installation docs updated to reflect Python 3.10–3.14 (capped at 3.15) support; GPU test timeout increased from 60 to 75 minutes.

Release Infrastructure & Cherry-Pick Skill

Layer / File(s) Summary
Release Cherry-Pick Skill
.claude/skills/release-cherry-pick/SKILL.md
New Claude skill automating PR cherry-picking from main into release/<version> branches via label-based selection, merge-commit retrieval, and automatic cherry-pick-done label application.
Release Notes
CHANGELOG.rst
Added entries for Minitron tutorial, FP8 vision-transformer MHA quantization, nvfp4_awq MoE export fix, and Python 3.14 experimental support.

Minitron Tutorial & Example Documentation

Layer / File(s) Summary
Example READMEs
examples/pruning/minitron/README.md, examples/pruning/puzzletron/README.md
New documentation pages linking end-to-end Minitron and Puzzletron pruning/distillation/quantization workflows and results.
Comprehensive Minitron Guide
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
Full reproduction guide covering data blending, pruning, knowledge distillation (with multiple token budgets), FP8 quantization, evaluation setup (NeMo Evaluator + Slurm), and vLLM throughput benchmarking with per-H100 results.
Ablation Study & Evaluation Config
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md, examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
Ablation study documenting blend experiments (30% pretraining / 70% post-training) across Nemotron variants; NeMo Evaluator YAML configuration for Slurm deployment with task-specific settings (MMLU Pro, GPQA, LiveCodeBench, AIME, SciCode).
Dataset Tokenization Guide
examples/dataset/MEGATRON_DATA_PREP.md, examples/dataset/README.md, examples/pruning/README.md
Extracted Megatron pre-tokenization documentation with JSONL/HF dataset commands, distillation blend hyperparameter guidance, and cross-links between pruning/distillation tutorials.
Dependency Constraints
examples/gpt-oss/requirements.txt
Tightened version ranges for kernels and trackio to avoid breaking changes.

MoE Expert Quantization & Export Enhancements

Layer / File(s) Summary
Core Quantizer Utilities
modelopt/torch/quantization/utils/core_utils.py
Added representative_weight_quantizer() helper to support both singular and plural (fused-experts ModuleList) quantizer layouts; updated weight_attr_names() to detect weights with representative quantizers.
Quantization Config & Conversion
modelopt/torch/quantization/config.py, modelopt/torch/quantization/conversion.py
Extended Mamba-MoE disabled quantizers to include Mcore self-attention naming patterns; added regex-based normalization of fused-experts quantizer names to enable wildcard matching across plural ModuleList forms.
Calibration & Model Export
modelopt/torch/quantization/model_calib.py
Enhanced awq_lite post-processing to detect and handle experts missed during search pass, unify disable fallback paths, and apply neutral pre_quant_scale=1 for consistent export across uncalibrated/disabled experts.
Plugin & Folding
modelopt/torch/quantization/plugins/huggingface.py
New fold_weight() override for _QuantFusedExperts to apply per-expert fake-quant to 3-D weights; added force_eager_experts_impl_on_the_fly() to recursively set _experts_implementation="eager" on models containing fused experts.
Export Utils & Wiring
modelopt/torch/export/plugins/vllm_fakequant_hf.py, modelopt/torch/export/quant_utils.py, modelopt/torch/export/unified_export_hf.py
Broadened weight-quantizer regex to match fused-experts plural forms; added _fakequant_fused_experts_weights() for per-expert fake-quant; updated export quantization format detection to use representative_weight_quantizer(); refactored _process_quantized_modules() to handle fused-experts earlier in the export pipeline.
Quantization Utilities Export
modelopt/torch/quantization/utils/__init__.py
Added representative_weight_quantizer to public API exports.

Distributed Checkpoint Saving & Model Export Improvements

Layer / File(s) Summary
Distributed Checkpoint Utility
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
New save_checkpoint_from_shards() function that gathers per-rank state dicts to rank 0, merges sharded weights, writes checkpoint (with safetensors index and subblocks directory), broadcasts any save errors, and synchronizes across all ranks.
Checkpoint Saving Integration
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
Updated call from save_checkpoint() to save_checkpoint_from_shards() for distributed model save support.
Megatron Importer Enhancements
modelopt/torch/export/plugins/megatron_importer.py
Updated grouped-MoE weight loading to compute per-rank global expert starting index via get_expert_model_parallel_rank(), iterate by local expert ID, and write to local weight slots; added divisibility assertion for expert distribution.
Model Support Expansion
modelopt/torch/export/unified_export_megatron.py
Added optional import and support for HybridModel in Megatron exporter validation.
Test Coverage
tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py
New test suite validating single-process (safetensors index/subblocks creation, weight tie exclusion, numerical correctness) and multi-process distributed saves with rank-sharded output verification.

ONNX Runtime, Export Utilities & Quantization Enhancements

Layer / File(s) Summary
Windows cuDNN DLL Loading
modelopt/onnx/quantization/ort_utils.py
Added helpers to locate and load cuDNN DLLs from nvidia-cudnn-cu12/cu13 site-packages on Windows; enhanced _check_for_libcudnn() to invoke extra DLL loader after ort.preload_dlls() succeeds.
AWQ Cleanup Refactoring
modelopt/onnx/quantization/int4.py
Introduced shared _remove_augmented_onnx() helper; refactored both _quantize_awq_clip() and _quantize_awq_lite() to use try/finally pattern with explicit session clearing, garbage collection, and cleanup.
Remote TensorRT Autotuning
modelopt/onnx/quantization/autotune/benchmark.py
Enhanced remote autotuning to emit debug message for TensorRT version check and append missing --skipInference flag with warning.
Zero-Scale QDQ Patching
modelopt/onnx/quantization/qdq_utils.py
Expanded replace_zero_scale_with_smallest_nonzero() to handle multiple Q/DQ op types (QuantizeLinear, DequantizeLinear, TRT_INT4*) via graph.initializer tensor updates and legacy Constant node paths.
New LLM Export Utilities
modelopt/onnx/llm_export_utils/export_utils.py, modelopt/onnx/llm_export_utils/quantization_utils.py, modelopt/onnx/llm_export_utils/surgeon_utils.py
New modules for HuggingFace LLM ONNX export: RopeType enum, ModelLoader for HF model loading, WrapperModelForCausalLM for legacy KV-cache format conversion, llm_to_onnx() for ONNX export with dynamic axes, quantize() for calibration-based quantization with configurable precision (FP8/NVFP4/INT4), and fold_fp8_qdq_to_dq() for FP8 weight transformation on TRT quantized graphs.
Legacy API Deprecation
modelopt/onnx/llm_export_utils/__init__.py
Added deprecation shim warning users to migrate to modelopt.onnx.export, modelopt.onnx.graph_surgery, or TensorRT-Edge-LLM.
Test Coverage
tests/unit/onnx/quantization/test_qdq_utils.py
New regression tests for replace_zero_scale_with_smallest_nonzero() validating both initializer and legacy Constant-node QDQ paths.

Dataset Utilities, KV Cache Management & Miscellaneous Improvements

Layer / File(s) Summary
KV Cache Context Manager
modelopt/torch/utils/dataset_utils.py
Introduced _disable_use_cache() context manager to temporarily disable KV caching; updated get_max_batch_size() and _forward_loop() to use context manager for cache-disabled inference.
Preprocessing Enhancements
modelopt/torch/utils/plugins/megatron_preprocess_data.py
Added error logging for apply_chat_template() failures; removed fallback to non-streaming mode when --hf_streaming lacks --hf_max_samples_per_split (streaming now remains enabled).
Chat Template Safety
modelopt/torch/utils/plugins/transformers_dataset.py
Added guard to exit early when chat_template is None before attempting string operations.
Distillation Plugin Refinement
modelopt/torch/distill/plugins/megatron.py
Refined layer-index regex in _adjust_layer_index_for_pp() to match indices at end-of-string and replace full dotted segments.
Puzzletron Utilities
modelopt/torch/puzzletron/replacement_library/build_replacement_library.py, modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
Enhanced NaN-handling when checking for missing checkpoint directories; updated gpt_oss_20b descriptor mapping to gpt_oss for consistency.
Speculative Decoding
examples/speculative_decoding/main.py
Extended data-module construction condition to include medusa mode alongside eagle3 and dflash.
Quantization Recipe
modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml
Disabled layerwise calibration for VLM decoder layers nested under model.language_model.layers.
Test Coverage & Cleanup
tests/unit/torch/quantization/plugins/test_fused_experts.py, tests/unit/torch/utils/test_dataset_utils.py, tests/gpu/torch/puzzletron/test_puzzletron.py
Expanded fused-experts tests (eager-impl forcing, calibration, mixed-precision export, quantizer naming); added _disable_use_cache and _forward_loop dataset utility tests; removed optional mip import guard from puzzletron tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Possibly related PRs

  • NVIDIA/Model-Optimizer#1352: Adds the same .claude/skills/release-cherry-pick/SKILL.md release-cherry-pick Claude skill for automated PR cherry-picking workflow.

Suggested labels

documentation, features, quantization, moe-experts, export, checkpoint-saving, python-3.14, onnx


Suggested reviewers

  • AAnoosheh
  • jenchen13
  • cjluo-nv
  • ChenhanYu
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-picks/release-0.44.0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-05 05:00 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (1)

175-192: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Potential NameError when realizable_as_symlinks=True and skip_validation=True.

If both conditions are true, the model variable is never assigned (line 175-176 condition is False), but line 192 still attempts to use it, causing a NameError.

This appears to be pre-existing behavior, but since the line is modified, worth noting. Consider adding model to the condition or guarding line 192.

🛡️ One possible fix
         if args.save_models:
             checkpoint_dir = (
                 args.solutions_path.with_name(f"{args.solutions_path.stem}--checkpoints")
                 / f"solution_{i_solution}"
             )

             model_config.dtype = resolve_torch_dtype(getattr(args, "model_dtype", "torch.bfloat16"))
             Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir)
             if realizable_as_symlinks:
                 if dist.is_master():
                     # TODO: Loo into internal Puzzleron code to see how to save as symlinks
                     # save_checkpoint_as_symlinks is currently not supported
                     pass
-            save_checkpoint_from_shards(model, checkpoint_dir, descriptor)
+            if not realizable_as_symlinks:
+                save_checkpoint_from_shards(model, checkpoint_dir, descriptor)

Or ensure model is always loaded when args.save_models is True:

-        if (args.save_models and not realizable_as_symlinks) or (not args.skip_validation):
+        if args.save_models or (not args.skip_validation):
             model = replacement_library.load_model(layer_replacements)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`
around lines 175 - 192, The code can raise NameError because model is only set
when (args.save_models and not realizable_as_symlinks) or (not
args.skip_validation) but save_checkpoint_from_shards(model, ...) always runs;
ensure model exists before use by either: move/recompute model assignment so
replacement_library.load_model(layer_replacements) runs whenever
args.save_models is True (e.g., include realizable_as_symlinks branch), or guard
the call to save_checkpoint_from_shards behind the same condition that sets
model (use the same boolean logic involving args.save_models,
realizable_as_symlinks, and args.skip_validation) so save_checkpoint_from_shards
only receives a valid model; update related model_config assignment
(model_config.dtype) to match the chosen approach.
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

632-649: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable and restore the plural fused-expert weight quantizers too.

The new fused-expert path strips *_weight_quantizers.<idx> from saved state, but Lines 632-649 still only disable singular *_weight_quantizer children. nn.ModuleList-backed fused-expert quantizers therefore remain enabled when modelopt_state() is captured, so the reload metadata can disagree with the fakequantized weights, and _check_all_weight_quantizers_disabled() will miss it for the same reason.

🧩 Suggested fix
         for _, module in model.named_modules():
             if isinstance(module, QuantModule):
                 for attr_name, quantizer in module.named_children():
-                    if not (attr_name.endswith("weight_quantizer") and quantizer.is_enabled):
-                        continue
-                    if isinstance(quantizer, SequentialQuantizer):
-                        quantizer.disable()
-                        for sub in quantizer:
-                            orig_rotate = sub._rotate
-                            if sub.rotate_is_enabled:
-                                sub._rotate = disable_rotate(sub)
-                            wqs_to_restore.append((sub, orig_rotate))
-                    elif isinstance(quantizer, TensorQuantizer):
-                        quantizer.disable()
-                        orig_rotate = quantizer._rotate
-                        if quantizer.rotate_is_enabled:
-                            quantizer._rotate = disable_rotate(quantizer)
-                        wqs_to_restore.append((quantizer, orig_rotate))
+                    quantizers: list[TensorQuantizer | SequentialQuantizer] = []
+                    if attr_name.endswith("weight_quantizers") and isinstance(quantizer, nn.ModuleList):
+                        quantizers = [
+                            q
+                            for q in quantizer
+                            if isinstance(q, (TensorQuantizer, SequentialQuantizer)) and q.is_enabled
+                        ]
+                    elif (
+                        attr_name.endswith("weight_quantizer")
+                        and isinstance(quantizer, (TensorQuantizer, SequentialQuantizer))
+                        and quantizer.is_enabled
+                    ):
+                        quantizers = [quantizer]
+
+                    for q in quantizers:
+                        if isinstance(q, SequentialQuantizer):
+                            q.disable()
+                            for sub in q:
+                                orig_rotate = sub._rotate
+                                if sub.rotate_is_enabled:
+                                    sub._rotate = disable_rotate(sub)
+                                wqs_to_restore.append((sub, orig_rotate))
+                        else:
+                            q.disable()
+                            orig_rotate = q._rotate
+                            if q.rotate_is_enabled:
+                                q._rotate = disable_rotate(q)
+                            wqs_to_restore.append((q, orig_rotate))

Please mirror the same plural/singular test in _check_all_weight_quantizers_disabled().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 632 - 649,
The loop that disables per-module weight quantizers only checks for singular
child names ending with "weight_quantizer" and misses plural fused-expert
ModuleList children (e.g., "weight_quantizers"), so update the disable logic in
the block that iterates model.named_modules() to also detect attr_name ending
with "weight_quantizers" (or the ModuleList-typed plural container), iterate its
elements, disable each element (handling SequentialQuantizer and TensorQuantizer
cases the same as the singular branch), call disable_rotate when
rotate_is_enabled, and append the (quantizer, orig_rotate) tuples to
wqs_to_restore; then mirror the same plural-vs-singular checks in
_check_all_weight_quantizers_disabled() so it verifies both single
"weight_quantizer" children and ModuleList "weight_quantizers" entries when
asserting all weight quantizers are disabled.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/ort_utils.py (1)

146-167: 💤 Low value

Consider adding a platform guard for defensive programming.

The function uses ctypes.windll (line 174) which only exists on Windows. While currently this function is only called within a Windows check (line 245-246), adding an early guard would prevent potential AttributeError if the function is ever called directly from elsewhere.

🛡️ Optional defensive guard
 def _load_extra_cudnn_dlls():
     """Load any cuDNN DLLs from site-packages that ORT's preload_dlls() missed.
     ...
     """
+    if platform.system() != "Windows":
+        return
+
     import ctypes
     import ctypes.wintypes
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/ort_utils.py` around lines 146 - 167, The
_load_extra_cudnn_dlls function should defensively return early on non-Windows
platforms to avoid AttributeError from using ctypes.windll; add a platform guard
at the top of _load_extra_cudnn_dlls (e.g., check sys.platform or os.name for
Windows) so the function exits immediately when not on Windows, while keeping
the existing _find_cudnn_bin_dir check and logging behavior unchanged.
tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py (1)

63-63: 💤 Low value

Minor: Redundant model instantiation.

get_tiny_llama() is called again just to access config.num_hidden_layers. Reuse the model from line 41 instead.

♻️ Suggested fix
-        assert cfg["num_hidden_layers"] == get_tiny_llama().config.num_hidden_layers
+        assert cfg["num_hidden_layers"] == model.config.num_hidden_layers
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py` at line 63,
The assertion redundantly calls get_tiny_llama() again to read
config.num_hidden_layers; instead reuse the already-instantiated model (variable
named model) created earlier (from get_tiny_llama()) and change the check to
compare cfg["num_hidden_layers"] with model.config.num_hidden_layers so you
remove the extra get_tiny_llama() call.
examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml (1)

18-23: ⚡ Quick win

Avoid requiring unrelated API secrets in this template.

Line 18–23 and Line 110–113 ask for API/OpenAI credentials even though this workflow’s listed tasks rely on HF-backed evals. Prefer scoping extra secrets only to tasks that need them to reduce unnecessary secret exposure.

Proposed tightening
-#   # Set additional unused but required environment variables:
-#   export API_KEY=xxxxxx
-#   export INFERENCE_API_KEY=xxxxxx
-#   export OPENAI_CLIENT_ID=xxxxxx
-#   export OPENAI_CLIENT_SECRET=xxxxxx
+#   # If a specific task/provider requires extra credentials, add them per-task.
   env_vars:
     HF_TOKEN: HF_TOKEN
     HF_HOME: HF_HOME
     VLLM_CACHE_ROOT: VLLM_CACHE_ROOT
-    API_KEY: API_KEY
-    INFERENCE_API_KEY: INFERENCE_API_KEY
-    OPENAI_CLIENT_ID: OPENAI_CLIENT_ID
-    OPENAI_CLIENT_SECRET: OPENAI_CLIENT_SECRET

Also applies to: 106-113

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml`
around lines 18 - 23, Remove the globally listed unrelated secret environment
variables (API_KEY, INFERENCE_API_KEY, OPENAI_CLIENT_ID, OPENAI_CLIENT_SECRET)
from the top-level template and instead scope them only to the specific tasks
that require them (or mark them as optional) so HF-backed evals don’t prompt for
unnecessary secrets; update the commented blocks around those env vars (the
comments containing API_KEY / INFERENCE_API_KEY / OPENAI_CLIENT_ID /
OPENAI_CLIENT_SECRET) and any duplicated occurrences later in the file to either
remove them or relocate them into the relevant task-level env sections.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/release-cherry-pick/SKILL.md:
- Around line 20-24: The gh api command in the SKILL.md snippet uses per_page=50
which will truncate results when more than 50 matching PRs exist; update the
command that constructs the search (the gh api "search/issues?q=...+per_page=50"
invocation) to handle pagination—either increase per_page to the GitHub max
(100) and/or add --paginate to the gh api call so all pages are returned, and
ensure the pipeline (the jq and sort usage) still consumes streamed results
correctly.
- Around line 28-35: The current text says "creating it from the remote if it
doesn't exist locally" but the shown git fetch/checkout will fail if
release/<VERSION> doesn't exist on the remote; update the Step 3 wording to
reflect that, or replace the commands with logic that first git fetch origin,
then test for origin/release/<VERSION> (e.g., using git rev-parse --verify
origin/release/<VERSION>) and if it exists run git checkout release/<VERSION>,
otherwise create the branch locally from main (git checkout -b release/<VERSION>
origin/main) and push it upstream (git push -u origin release/<VERSION>).

In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md`:
- Line 146: Fix the user-facing typos in the README: change "atmost" to "at
most" and replace the shorthand "hparams" with "hyperparameters" (e.g., the
sentence "Only considering atmost 40% for width and 20% for depth pruning
hparams" should read "Only considering at most 40% for width and 20% for depth
pruning hyperparameters"); apply the same wording corrections to the equivalent
lines around Line 184 and Line 228 so all tutorial sentences use "at most" and
"hyperparameters" consistently.

In `@examples/pruning/minitron/README.md`:
- Line 3: The intro sentence in README.md has a missing space after the comma
("evaluation,and"); update the sentence in the README (the Minitron tutorial
intro line) to read "evaluation, and" by inserting a space after the comma so it
becomes "evaluation, and vLLM deployment."

In `@examples/pruning/README.md`:
- Around line 295-300: The "Data Composition" guidance mixes the spellings
"pre-training" and "pretraining"; pick one consistent spelling (e.g.,
"pretraining") and update every occurrence in this section including the entries
under "Data Composition" and the related lines noted (previously 295-300 and
307-310) so all user-facing text uses the same term; search for both
"pre-training" and "pretraining" in the README.md and replace them consistently
(references: the "Data Composition" row and the adjacent bullet lines).

In `@modelopt/onnx/quantization/int4.py`:
- Around line 1322-1324: The bug is that new_tensor.get() is called
unconditionally which fails when cupy is unavailable; update the conversion
before calling numpy_helper.from_array to use np.asnumpy(new_tensor) when
has_cupy is True and otherwise pass the numpy array directly (i.e., use the same
pattern as other conversions: conditionally call np.asnumpy on new_tensor before
numpy_helper.from_array), referencing the variables/new calls new_tensor,
has_cupy, np.asnumpy and numpy_helper.from_array so the fix is applied where
new_tensor is created and passed into numpy_helper.from_array.

In `@modelopt/torch/distill/plugins/megatron.py`:
- Line 175: The current line uses submodule_name.replace(f".{match.group(0)}",
f".{new_layer_idx}") which replaces every identical ".<idx>" token; instead,
only replace the specific match instance. Locate the code referencing
submodule_name, match, new_layer_idx and change it to perform a single-instance
replacement using the match span (e.g., compute start,end = match.span(0) and
build new_submodule_name = submodule_name[:start] + f".{new_layer_idx}" +
submodule_name[end:]) or use re.sub with count=1 anchored to the exact match;
ensure only the matched token is changed.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 653-659: The current check sends any module with
gate_up_proj_weight_quantizers to _export_fused_experts regardless of whether
quantization is actually enabled; update the condition to only call
_export_fused_experts when the module both exposes
gate_up_proj_weight_quantizers and is actually quantized (e.g., change the if to
require hasattr(sub_module, "gate_up_proj_weight_quantizers") and
get_quantization_format(sub_module) != QUANTIZATION_NONE), or alternatively
inspect the items in sub_module.gate_up_proj_weight_quantizers for an
enabled/active flag before entering the
fsdp2_aware_weight_update/_export_fused_experts path so non-quantized
_QuantFusedExperts are not rewritten. Ensure you adjust/remove the following
branches accordingly: the if that references gate_up_proj_weight_quantizers, the
subsequent elif using get_quantization_format, and keep calls to
_export_fused_experts and fsdp2_aware_weight_update only when quantization is
active.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 231-235: The merge loop using full_sd.update(shard_sd) can
silently overwrite real tensors with placeholder zeros from other ranks; change
the merge to iterate each shard_sd's items and for each key k: if k not in
full_sd set full_sd[k]=v, else verify torch.equal(full_sd[k], v) (and raise a
ValueError on mismatch) or skip if you prefer keeping the first-seen owner—apply
this logic where full_sd and gathered are used to build the combined state_dict
(the shard_sd aggregation loop).

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 522-537: The probe tensor target_input is only built once before
the retry loop, so after an OOM and reducing target_data_batch you must rebuild
target_input using sample_input_single_batch.expand(...) with the new
target_data_batch before calling infer_method; move the expand logic (or
recreate target_input) inside the while loop (referencing
sample_input_single_batch, target_data_batch, and infer_method) so each retry
tests the correctly sized probe batch under the
torch.set_grad_enabled(enable_grad) context.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 195-202: The exception handler in the try/except around
_Encoder.tokenizer.apply_chat_template currently prints the entire data payload
(variable data), which risks leaking sensitive content; replace the print of
json.dumps(data, ...) with a safe, minimal identifier and summary: log a stable
id if present (e.g., data.get("id") or data.get("row_id")), or compute and log a
short hash/fingerprint (e.g., SHA256 hex prefix) of the serialized row plus a
small sanitized field summary (e.g., length and first 100 chars of a sanitized
"text" field or list of keys), and include the exception message and context
(function name/tokenizer call) instead of the full payload in the raise/LOG call
in the except block that follows the call to
_Encoder.tokenizer.apply_chat_template.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py`:
- Around line 120-132: The test_distributed_save_creates_valid_checkpoint
currently only verifies keys; update it to also load the saved tensors
referenced by SAFE_WEIGHTS_INDEX_NAME/weight_map (and shards in
SAFETENSORS_SUBBLOCKS_DIR_NAME) and compare each tensor's data to the original
model from get_tiny_llama() (use the same key mapping) using an element-wise
comparison (e.g., torch.equal or torch.allclose with a tolerance) to ensure
values match—mirror the approach used in test_saved_weights_match_original and
perform comparisons for every entry in index["weight_map"] to detect silent
corruption from _distributed_save_worker.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 386-441: The test is flaky because routing depends on random model
init; seed the model and make the forward_loop drive tokens to every expert
deterministically: call torch.manual_seed(...) before creating _TinyMoEModel()
(so weights are deterministic) and replace the current forward_loop with one
that, for idx in range(NUM_EXPERTS), constructs inputs that target expert idx
(e.g., unique per-expert input vectors or one-hot-like patterns) and runs m(x)
so each expert in model.moe.experts is exercised at least once before
calibration assertions.

---

Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 632-649: The loop that disables per-module weight quantizers only
checks for singular child names ending with "weight_quantizer" and misses plural
fused-expert ModuleList children (e.g., "weight_quantizers"), so update the
disable logic in the block that iterates model.named_modules() to also detect
attr_name ending with "weight_quantizers" (or the ModuleList-typed plural
container), iterate its elements, disable each element (handling
SequentialQuantizer and TensorQuantizer cases the same as the singular branch),
call disable_rotate when rotate_is_enabled, and append the (quantizer,
orig_rotate) tuples to wqs_to_restore; then mirror the same plural-vs-singular
checks in _check_all_weight_quantizers_disabled() so it verifies both single
"weight_quantizer" children and ModuleList "weight_quantizers" entries when
asserting all weight quantizers are disabled.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 175-192: The code can raise NameError because model is only set
when (args.save_models and not realizable_as_symlinks) or (not
args.skip_validation) but save_checkpoint_from_shards(model, ...) always runs;
ensure model exists before use by either: move/recompute model assignment so
replacement_library.load_model(layer_replacements) runs whenever
args.save_models is True (e.g., include realizable_as_symlinks branch), or guard
the call to save_checkpoint_from_shards behind the same condition that sets
model (use the same boolean logic involving args.save_models,
realizable_as_symlinks, and args.skip_validation) so save_checkpoint_from_shards
only receives a valid model; update related model_config assignment
(model_config.dtype) to match the chosen approach.

---

Nitpick comments:
In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml`:
- Around line 18-23: Remove the globally listed unrelated secret environment
variables (API_KEY, INFERENCE_API_KEY, OPENAI_CLIENT_ID, OPENAI_CLIENT_SECRET)
from the top-level template and instead scope them only to the specific tasks
that require them (or mark them as optional) so HF-backed evals don’t prompt for
unnecessary secrets; update the commented blocks around those env vars (the
comments containing API_KEY / INFERENCE_API_KEY / OPENAI_CLIENT_ID /
OPENAI_CLIENT_SECRET) and any duplicated occurrences later in the file to either
remove them or relocate them into the relevant task-level env sections.

In `@modelopt/onnx/quantization/ort_utils.py`:
- Around line 146-167: The _load_extra_cudnn_dlls function should defensively
return early on non-Windows platforms to avoid AttributeError from using
ctypes.windll; add a platform guard at the top of _load_extra_cudnn_dlls (e.g.,
check sys.platform or os.name for Windows) so the function exits immediately
when not on Windows, while keeping the existing _find_cudnn_bin_dir check and
logging behavior unchanged.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py`:
- Line 63: The assertion redundantly calls get_tiny_llama() again to read
config.num_hidden_layers; instead reuse the already-instantiated model (variable
named model) created earlier (from get_tiny_llama()) and change the check to
compare cfg["num_hidden_layers"] with model.config.num_hidden_layers so you
remove the extra get_tiny_llama() call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c29df816-dd02-41f0-81cf-28e64d80103b

📥 Commits

Reviewing files that changed from the base of the PR and between b1ec471 and 1b2f029.

⛔ Files ignored due to path filters (1)
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png is excluded by !**/*.png
📒 Files selected for processing (44)
  • .claude/skills/release-cherry-pick/SKILL.md
  • CHANGELOG.rst
  • docs/source/getting_started/windows/_installation_standalone.rst
  • examples/dataset/MEGATRON_DATA_PREP.md
  • examples/dataset/README.md
  • examples/megatron_bridge/README.md
  • examples/pruning/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
  • examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
  • examples/pruning/minitron/README.md
  • examples/pruning/puzzletron/Llama-3.1-8B-Instruct.md
  • examples/pruning/puzzletron/README.md
  • examples/puzzletron/README.md
  • examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/pruning/attn_pruning.yaml
  • examples/speculative_decoding/main.py
  • modelopt/onnx/quantization/autotune/benchmark.py
  • modelopt/onnx/quantization/int4.py
  • modelopt/onnx/quantization/ort_utils.py
  • modelopt/torch/distill/plugins/megatron.py
  • modelopt/torch/export/plugins/megatron_importer.py
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/puzzletron/anymodel/model_descriptor/model_descriptor_factory.py
  • modelopt/torch/puzzletron/replacement_library/build_replacement_library.py
  • modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
  • modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/conversion.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • modelopt/torch/quantization/utils/__init__.py
  • modelopt/torch/quantization/utils/core_utils.py
  • modelopt/torch/utils/dataset_utils.py
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py
  • modelopt/torch/utils/plugins/transformers_dataset.py
  • modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml
  • tests/gpu/torch/puzzletron/test_puzzletron.py
  • tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py
  • tests/regression/torch/speculative/test_dflash_offline.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py
  • tests/unit/torch/utils/test_dataset_utils.py
💤 Files with no reviewable changes (2)
  • tests/gpu/torch/puzzletron/test_puzzletron.py
  • modelopt/onnx/quantization/autotune/benchmark.py

Comment on lines +20 to +24
```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle pagination for repositories with many pending cherry-picks.

The GitHub API query uses per_page=50, which limits results to 50 PRs. If more than 50 PRs are labeled cherry-pick-<VERSION> without cherry-pick-done, later PRs will be silently omitted from the cherry-pick batch.

📄 Suggested approaches

Option 1: Increase the page size to a safer limit:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \

Option 2: Add pagination using --paginate:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api --paginate "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \
   --jq '.items[] | [.number, .title, .pull_request.merged_at] | `@tsv`' \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 20 - 24, The gh api
command in the SKILL.md snippet uses per_page=50 which will truncate results
when more than 50 matching PRs exist; update the command that constructs the
search (the gh api "search/issues?q=...+per_page=50" invocation) to handle
pagination—either increase per_page to the GitHub max (100) and/or add
--paginate to the gh api call so all pages are returned, and ensure the pipeline
(the jq and sort usage) still consumes streamed results correctly.

Comment on lines +28 to +35
## Step 3 — Set up the release branch

Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:

```bash
git fetch origin release/<VERSION>
git checkout release/<VERSION>
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify release branch creation behavior.

The description states "creating it from the remote if it doesn't exist locally," but the provided commands will fail if the release/<VERSION> branch doesn't exist on the remote. The commands only check out an existing remote branch—they don't create it.

📝 Suggested clarification

Either update the description to reflect the actual behavior:

-Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:
+Check out the existing `release/<VERSION>` branch from the remote:

Or provide commands that handle both cases:

+If the release branch doesn't exist yet, create it from main:
+
+```bash
+git fetch origin
+if git rev-parse --verify origin/release/<VERSION> >/dev/null 2>&1; then
+  git checkout release/<VERSION>
+else
+  git checkout -b release/<VERSION> origin/main
+  git push -u origin release/<VERSION>
+fi
+```
+
+Otherwise, check out the existing branch:
+
 ```bash
 git fetch origin release/<VERSION>
 git checkout release/<VERSION>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 28 - 35, The
current text says "creating it from the remote if it doesn't exist locally" but
the shown git fetch/checkout will fail if release/ doesn't exist on the
remote; update the Step 3 wording to reflect that, or replace the commands with
logic that first git fetch origin, then test for origin/release/ (e.g.,
using git rev-parse --verify origin/release/) and if it exists run git
checkout release/, otherwise create the branch locally from main (git
checkout -b release/ origin/main) and push it upstream (git push -u
origin release/).


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- 4e71b3a2 -->

<!-- This is an auto-generated comment by CodeRabbit -->

Important pruning logs:

```text
Only considering atmost 40% for width and 20% for depth pruning hparams
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix small typos in user-facing instructions.

There are a few wording typos in key tutorial lines (Line 146, Line 184, Line 228) that are quick to clean up.

Suggested edits
-Only considering atmost 40% for width and 20% for depth pruning hparams
+Only considering at most 40% for width and 20% for depth pruning hparams
-> While we use 96 nodes here for faster training, you can also run with 1 node. If you dont want to do full distillation run, you can stop earlier and take intermediate checkpoints as well.
+> While we use 96 nodes here for faster training, you can also run with 1 node. If you don't want to do a full distillation run, you can stop earlier and take intermediate checkpoints as well.
-The eval config xin [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it.
+The eval config in [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it.

Also applies to: 184-184, 228-228

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md` at line 146,
Fix the user-facing typos in the README: change "atmost" to "at most" and
replace the shorthand "hparams" with "hyperparameters" (e.g., the sentence "Only
considering atmost 40% for width and 20% for depth pruning hparams" should read
"Only considering at most 40% for width and 20% for depth pruning
hyperparameters"); apply the same wording corrections to the equivalent lines
around Line 184 and Line 228 so all tutorial sentences use "at most" and
"hyperparameters" consistently.

@@ -0,0 +1,11 @@
# Minitron Pruning — End-to-End Tutorials

End-to-end tutorials for [Minitron](https://arxiv.org/abs/2407.14679) structured pruning followed by knowledge distillation, quantization, evaluation,and vLLM deployment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the missing space in the intro sentence.

evaluation,and should read evaluation, and.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/pruning/minitron/README.md` at line 3, The intro sentence in
README.md has a missing space after the comma ("evaluation,and"); update the
sentence in the README (the Minitron tutorial intro line) to read "evaluation,
and" by inserting a space after the comma so it becomes "evaluation, and vLLM
deployment."

Comment on lines +295 to 300
| **Global Batch Size (GBS)** | same as the original training or 768 if unsure |
| **Micro Batch Size (MBS)** | As large as your GPU memory can accommodate |
| **Learning Rate (LR)** | 1e-4 → 1e-5 (linear decay) for 30-50% pruning<br>• More compression → higher LR<br>• Less compression → lower LR<br>• As model gets larger → reduce LR to avoid divergence |
| **Warmup Steps** | 100 |
| **Training Max Steps** | Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens |
| **Training Max Steps** | Num training tokens / (Seq len × GBS)<br>• Recommended: 80-100B tokens for best results. |
| **Data Composition** | • Standard models: 100% pre-training data<br>• Reasoning models: 70% reasoning data + 30% pre-training data |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use one spelling of “pretraining” in this section.

The edited guidance now mixes pre-training and pretraining, which reads like an edit artifact in user-facing docs.

Also applies to: 307-310

🧰 Tools
🪛 LanguageTool

[uncategorized] ~300-~300: Do not mix variants of the same word (‘pre-train’ and ‘pretrain’) within a single text.
Context: ...Composition** | • Standard models: 100% pre-training data
• Reasoning models: 70% reasoni...

(EN_WORD_COHERENCY)


[uncategorized] ~300-~300: Do not mix variants of the same word (‘pre-train’ and ‘pretrain’) within a single text.
Context: ...soning models: 70% reasoning data + 30% pre-training data | > [!TIP] > If you know the maxi...

(EN_WORD_COHERENCY)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/pruning/README.md` around lines 295 - 300, The "Data Composition"
guidance mixes the spellings "pre-training" and "pretraining"; pick one
consistent spelling (e.g., "pretraining") and update every occurrence in this
section including the entries under "Data Composition" and the related lines
noted (previously 295-300 and 307-310) so all user-facing text uses the same
term; search for both "pre-training" and "pretraining" in the README.md and
replace them consistently (references: the "Data Composition" row and the
adjacent bullet lines).

Comment on lines +231 to +235
full_sd: dict[str, torch.Tensor] = {}
for shard_sd in gathered:
if shard_sd is None:
continue
full_sd.update(shard_sd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Merge logic is fragile with overlapping keys across ranks.

The full_sd.update(shard_sd) approach silently overwrites values when multiple ranks have the same key. If a rank provides a "placeholder" value (e.g., zeros) for keys it doesn't own, later ranks will overwrite earlier ranks' real values.

This is exactly what the distributed test does—each rank zeroes keys it doesn't own, so the final merged state will have zeros from the last rank overwriting real values from earlier ranks for those keys.

Consider either:

  1. Only updating keys that are missing from full_sd, or
  2. Documenting that ranks must provide disjoint key sets (no overlapping keys).
🛡️ Proposed fix to avoid overwriting existing values
             full_sd: dict[str, torch.Tensor] = {}
             for shard_sd in gathered:
                 if shard_sd is None:
                     continue
-                full_sd.update(shard_sd)
+                for k, v in shard_sd.items():
+                    if k not in full_sd:
+                        full_sd[k] = v

Alternatively, if overlapping keys with identical values are expected, you could add a check:

for k, v in shard_sd.items():
    if k in full_sd:
        if not torch.equal(full_sd[k], v):
            raise ValueError(f"Key {k} has conflicting values across ranks")
    else:
        full_sd[k] = v
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 231 -
235, The merge loop using full_sd.update(shard_sd) can silently overwrite real
tensors with placeholder zeros from other ranks; change the merge to iterate
each shard_sd's items and for each key k: if k not in full_sd set full_sd[k]=v,
else verify torch.equal(full_sd[k], v) (and raise a ValueError on mismatch) or
skip if you prefer keeping the first-seen owner—apply this logic where full_sd
and gathered are used to build the combined state_dict (the shard_sd aggregation
loop).

Comment on lines +522 to +537
target_input = sample_input_single_batch.expand(
[
target_data_batch if index == 0 else dim
for index, dim in enumerate(sample_input_single_batch.shape)
]
)
target_data_batch = 1
else:
target_data_batch = max(int(free_mem_before / mem_diff_per_data_batch), 1)
target_input = sample_input_single_batch.expand(
[
target_data_batch if index == 0 else dim
for index, dim in enumerate(sample_input_single_batch.shape)
]
)

# For some models on multi GPU, we observe the memory per batch is not a constant.
# So we just test the target batch size and make sure we do not go OOM.
while target_data_batch > 1:
with torch.set_grad_enabled(enable_grad):
try:
infer_method(target_input)
break
except torch.cuda.OutOfMemoryError:
target_data_batch = target_data_batch // 2
# For some models on multi GPU, we observe the memory per batch is not a constant.
# So we just test the target batch size and make sure we do not go OOM.
while target_data_batch > 1:
with torch.set_grad_enabled(enable_grad):
try:
infer_method(target_input)
break
except torch.cuda.OutOfMemoryError: # pragma: no cover - GPU OOM retry path
target_data_batch = target_data_batch // 2 # pragma: no cover
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rebuild the probe tensor after each OOM retry.

target_input is expanded once before the retry loop, so every retry still uses the original oversized batch. If the first estimate OOMs, this path just counts target_data_batch down to 1 without ever validating a smaller candidate, which can collapse auto-batching to 1 even when a larger batch would fit.

Suggested fix
-        target_input = sample_input_single_batch.expand(
-            [
-                target_data_batch if index == 0 else dim
-                for index, dim in enumerate(sample_input_single_batch.shape)
-            ]
-        )
-
         # For some models on multi GPU, we observe the memory per batch is not a constant.
         # So we just test the target batch size and make sure we do not go OOM.
         while target_data_batch > 1:
+            target_input = sample_input_single_batch.expand(
+                [
+                    target_data_batch if index == 0 else dim
+                    for index, dim in enumerate(sample_input_single_batch.shape)
+                ]
+            )
             with torch.set_grad_enabled(enable_grad):
                 try:
                     infer_method(target_input)
                     break
                 except torch.cuda.OutOfMemoryError:  # pragma: no cover - GPU OOM retry path
+                    torch.cuda.empty_cache()  # pragma: no cover
                     target_data_batch = target_data_batch // 2  # pragma: no cover
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/dataset_utils.py` around lines 522 - 537, The probe
tensor target_input is only built once before the retry loop, so after an OOM
and reducing target_data_batch you must rebuild target_input using
sample_input_single_batch.expand(...) with the new target_data_batch before
calling infer_method; move the expand logic (or recreate target_input) inside
the while loop (referencing sample_input_single_batch, target_data_batch, and
infer_method) so each retry tests the correctly sized probe batch under the
torch.set_grad_enabled(enable_grad) context.

Comment on lines +195 to +202
try:
text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
except Exception as e:
print(
f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}",
flush=True,
)
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid dumping the full failing sample into stdout.

Line 199 prints the entire JSON row when apply_chat_template() fails. On real SFT/pretraining data that can leak prompts, tool payloads, PII, or secrets into shared logs, and very large rows will also blow up log volume. Log a stable identifier / field summary instead of the full payload.

🔒 Suggested fix
                 try:
                     text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
                 except Exception as e:
+                    sample_id = data.get("id") or data.get("uuid") or data.get("conversation_id")
                     print(
-                        f"apply_chat_template failed: {e}\nData:\n{json.dumps(data, indent=2, default=str)}",
+                        "apply_chat_template failed "
+                        f"for sample={sample_id!r}: {e}. "
+                        f"Available keys: {sorted(data.keys())}",
                         flush=True,
                     )
                     raise

As per coding guidelines, SECURITY.md says to treat all model/weight/config artifacts and data as untrusted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 195 -
202, The exception handler in the try/except around
_Encoder.tokenizer.apply_chat_template currently prints the entire data payload
(variable data), which risks leaking sensitive content; replace the print of
json.dumps(data, ...) with a safe, minimal identifier and summary: log a stable
id if present (e.g., data.get("id") or data.get("row_id")), or compute and log a
short hash/fingerprint (e.g., SHA256 hex prefix) of the serialized row plus a
small sanitized field summary (e.g., length and first 100 chars of a sanitized
"text" field or list of keys), and include the exception message and context
(function name/tokenizer call) instead of the full payload in the raise/LOG call
in the except block that follows the call to
_Encoder.tokenizer.apply_chat_template.

Comment on lines +120 to +132
def test_distributed_save_creates_valid_checkpoint(self, tmp_path):
spawn_multiprocess_job(2, partial(_distributed_save_worker, checkpoint_dir=tmp_path))

index_path = tmp_path / SAFE_WEIGHTS_INDEX_NAME
assert index_path.exists()
index = json.loads(index_path.read_text())

model = get_tiny_llama()
expected_keys = set(model.state_dict().keys())
assert set(index["weight_map"].keys()) == expected_keys

shard_files = list((tmp_path / SAFETENSORS_SUBBLOCKS_DIR_NAME).glob("*.safetensors"))
assert len(shard_files) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Multi-process test should verify saved tensor values, not just keys.

The test only checks that weight_map.keys() match expected keys, but doesn't verify the actual tensor values like test_saved_weights_match_original does. Given the merge logic concern in the main implementation, verifying tensor correctness would catch silent data corruption.

💚 Proposed enhancement to verify tensor values
         shard_files = list((tmp_path / SAFETENSORS_SUBBLOCKS_DIR_NAME).glob("*.safetensors"))
         assert len(shard_files) > 0
+
+        # Verify saved values match original model
+        reloaded_sd = {}
+        for shard in shard_files:
+            reloaded_sd.update(safe_load_file(str(shard)))
+
+        original_sd = {k: v.cpu() for k, v in model.state_dict().items()}
+        for key in expected_keys:
+            torch.testing.assert_close(reloaded_sd[key], original_sd[key])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu/torch/puzzletron/tools/test_save_ckpt_from_shards.py` around lines
120 - 132, The test_distributed_save_creates_valid_checkpoint currently only
verifies keys; update it to also load the saved tensors referenced by
SAFE_WEIGHTS_INDEX_NAME/weight_map (and shards in
SAFETENSORS_SUBBLOCKS_DIR_NAME) and compare each tensor's data to the original
model from get_tiny_llama() (use the same key mapping) using an element-wise
comparison (e.g., torch.equal or torch.allclose with a tolerance) to ensure
values match—mirror the approach used in test_saved_weights_match_original and
perform comparisons for every entry in index["weight_map"] to detect silent
corruption from _distributed_save_worker.

Comment on lines +386 to +441
def test_calibration_populates_all_expert_quantizers(self):
"""After PTQ, every input/weight quantizer on the fused-experts module has amax set."""
import modelopt.torch.quantization as mtq

model = _TinyMoEModel()
expert_type = type(model.moe.experts)
self._cleanup_registry(expert_type)

quant_cfg = {
"quant_cfg": [
{"quantizer_name": "*", "enable": False},
{
"quantizer_name": "*gate_up_proj_input_quantizer",
"cfg": {"num_bits": 8, "axis": None},
},
{
"quantizer_name": "*down_proj_input_quantizer",
"cfg": {"num_bits": 8, "axis": None},
},
{
"quantizer_name": "*gate_up_proj_weight_quantizer",
"cfg": {"num_bits": 8, "axis": 0},
},
{
"quantizer_name": "*down_proj_weight_quantizer",
"cfg": {"num_bits": 8, "axis": 0},
},
],
"algorithm": "max",
}

def forward_loop(m):
torch.manual_seed(0)
for _ in range(2):
x = torch.randn(1, 4, HIDDEN_DIM)
m(x)

mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

experts = model.moe.experts
assert experts.gate_up_proj_input_quantizer.amax is not None, (
"Shared gate_up_proj input quantizer was not calibrated — "
"F.linear hook likely bypassed by non-eager experts_implementation."
)
assert experts.down_proj_input_quantizer.amax is not None, (
"Shared down_proj input quantizer was not calibrated."
)
for idx in range(NUM_EXPERTS):
assert experts.gate_up_proj_weight_quantizers[idx].amax is not None, (
f"gate_up_proj_weight_quantizers[{idx}].amax is None — "
"plural ModuleList name normalization in _match_quantizer likely broken."
)
assert experts.down_proj_weight_quantizers[idx].amax is not None, (
f"down_proj_weight_quantizers[{idx}].amax is None."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make expert coverage deterministic in this calibration test.

This assertion requires every expert to receive tokens, but _TinyMoEModel() is created from unseeded random weights and the router path is data-dependent. That makes the test flaky: a different initialization can leave one or more experts untouched and keep their amax at None even when the fused-experts path is working correctly. Please force a deterministic routing pattern that exercises all experts before asserting on every per-expert quantizer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 386
- 441, The test is flaky because routing depends on random model init; seed the
model and make the forward_loop drive tokens to every expert deterministically:
call torch.manual_seed(...) before creating _TinyMoEModel() (so weights are
deterministic) and replace the current forward_loop with one that, for idx in
range(NUM_EXPERTS), constructs inputs that target expert idx (e.g., unique
per-expert input vectors or one-hot-like patterns) and runs m(x) so each expert
in model.moe.experts is exercised at least once before calibration assertions.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 41.40401% with 409 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.10%. Comparing base (b1ec471) to head (6b9f370).
⚠️ Report is 1 commits behind head on release/0.44.0.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/int4.py 64.77% 93 Missing ⚠️
modelopt/onnx/llm_export_utils/export_utils.py 0.00% 77 Missing ⚠️
modelopt/onnx/llm_export_utils/surgeon_utils.py 0.00% 53 Missing ⚠️
...delopt/onnx/llm_export_utils/quantization_utils.py 0.00% 51 Missing ⚠️
modelopt/onnx/quantization/ort_utils.py 15.09% 45 Missing ⚠️
modelopt/torch/export/plugins/vllm_fakequant_hf.py 7.14% 26 Missing ⚠️
...lopt/torch/puzzletron/tools/checkpoint_utils_hf.py 12.50% 21 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py 46.66% 16 Missing ⚠️
modelopt/torch/export/plugins/megatron_importer.py 0.00% 6 Missing ⚠️
...pt/torch/utils/plugins/megatron_preprocess_data.py 0.00% 5 Missing ⚠️
... and 7 more
Additional details and impacted files
@@                 Coverage Diff                 @@
##           release/0.44.0    #1385       +/-   ##
===================================================
- Coverage           75.41%   65.10%   -10.32%     
===================================================
  Files                 463      465        +2     
  Lines               50208    50478      +270     
===================================================
- Hits                37865    32863     -5002     
- Misses              12343    17615     +5272     
Flag Coverage Δ
examples 32.95% <11.31%> (-7.76%) ⬇️
gpu 26.69% <6.30%> (-32.44%) ⬇️
regression 14.69% <2.00%> (-0.11%) ⬇️
unit 52.66% <38.39%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner May 4, 2026 20:31
@kevalmorabia97 kevalmorabia97 requested review from AAnoosheh and removed request for a team, Edwardf0t1, cjluo-nv, realAsma and yeyu-nvidia May 4, 2026 20:32
@kevalmorabia97 kevalmorabia97 changed the title [Cherry-pick] PRs #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 [Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 May 4, 2026
kaix-nv and others added 5 commits May 4, 2026 21:23
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->
Bug fix. Fix sparsity-only export writing `hf_quant_config.json` with
null `quant_algo`.

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved quantization metadata handling in model export to correctly
identify quantized checkpoints based on algorithm presence.

* **Style**
  * Reorganized imports across example files for consistency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…ting on Ubuntu 26.04 + Python 3.14 (#1386)

Ubuntu 26.04 is here and very soon, NVIDIA PyTorch containers will ship
with Python 3.14 requiring us to enable untested support to unblock them

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **New Features**
  * DFlash offline speculative decoding training
  * MXFP4→NVFP4 weight conversion support
  * Shared hidden-state dump utilities
  * Updated DeepSeek PTQ calibration defaults

* **Chores**
  * Added Python 3.14 support; updated Python requirement to <3.15

* **Documentation**
  * Updated installation documentation for Python version compatibility

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix

Fix `replace_zero_scale_with_smallest_nonzero()` in
`modelopt/onnx/quantization/qdq_utils.py` so the FP16-scale sanitizer
actually runs for INT4_AWQ ONNX exports.

The function is supposed to ensure all FP16 scales are strictly positive
before the model reaches TensorRT, since `trtexec --stronglyTyped`
asserts `scaleAllPositive`. It had two latent bugs that made it a
complete no-op for INT4_AWQ:

1. It only collected scales from `QuantizeLinear` consumers — but
INT4_AWQ exports use `DequantizeLinear` (default domain) and
`TRT_INT4DequantizeLinear` (trt:: domain). There are zero
`QuantizeLinear` nodes in such graphs, so the collected set was empty.
2. It only patched scales emitted by `Constant` nodes — but INT4_AWQ
stores scales as graph initializers.

When the FP32→FP16 cast in `_convert_fp32_init_to_fp16()` underflowed
small amax values to `0.0` (FP16 min subnormal is 5.96e-8), those zeros
sailed through into the exported ONNX. TRT then rejected the model with:

```
Assertion failed: (scaleAllPositive || allowNegativeScale): Scale coefficients must all be positive
```

This PR extends the sanitizer to:
- Walk `QuantizeLinear` / `DequantizeLinear` / `TRT_INT4QuantizeLinear`
/ `TRT_INT4DequantizeLinear` nodes when collecting scale tensor names.
- Patch zero entries in float-typed graph initializers in addition to
`Constant`-node values, preserving the original dtype.

Files modified:
- `modelopt/onnx/quantization/qdq_utils.py` — fix.
- `tests/unit/onnx/quantization/test_qdq_utils.py` —
`TestReplaceZeroScaleWithSmallestNonzero` regression tests.
- `CHANGELOG.rst` — bug-fix entry under 0.45.

### Usage

```bash
# Repro that previously failed and now succeeds:
python torch_quant_to_onnx.py \
    --quantize_mode=int4_awq \
    --timm_model_name=vit_base_patch16_224 \
    --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \
    --calibration_data_size=32

trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInference
```

### Testing

- New tests pass: `pytest
tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero
-v` (3 passed).
- Full file: `pytest tests/unit/onnx/quantization/test_qdq_utils.py` (25
passed).
- Broader sanity: `pytest tests/unit/onnx/quantization/` (288 passed).
- Smoke test on a synthetic model with explicit zero scales: 3 zeros → 0
zeros, all positive, FP16 dtype preserved.
- `pre-commit run --files <changed>` clean.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅

### Additional Information

- NVBug: 6110209
- Reproduces with `nvidia-modelopt==0.44.0rc1`, TensorRT 10.15.1.29, on
B200 / H20.
- Root cause introduced when the `TRT_INT4DequantizeLinear` export path
was added (PR #575, commit `0a4f0a8b`); that PR didn't update the
sanitizer to handle the new node type or scale storage.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Zero scale values in quantization/dequantization ops (including
additional operator variants) are now replaced with the smallest nonzero
fp16 scale for matching tensors; replacements preserve the original
tensor data type and handle scales provided via initializers or constant
nodes.

* **Tests**
* Added regression tests covering initializer- and constant-backed scale
tensors across multiple operator configurations to ensure zeros are
eliminated and dtype is preserved.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
….1 compat (#1356)

## What does this PR do?

**Type of change:** Bug fix (backward-compat restoration)

#1210 (b3feebf, "Replace in-repo LLM ONNX export with
TensorRT-Edge-LLM") removed `modelopt/onnx/llm_export_utils/` from
0.44.0rc1 and pointed users at
[TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) as the
migration target. The PR description openly flagged the change as not
backward compatible.

The catch: **TensorRT-Edge-LLM 0.6.1 itself imports the deleted
symbol**:

```python
# tensorrt_edgellm/onnx_export/onnx_utils.py:24
from modelopt.onnx.llm_export_utils.surgeon_utils import fold_fp8_qdq_to_dq
```

Because that import is at module load time, every `tensorrt-edgellm-*`
CLI — including `tensorrt-edgellm-quantize-llm --help` — fails
immediately with `ModuleNotFoundError`. QA reports a 5/5 (100%) failure
rate on `tests/test_onnx_ptq/test_onnx_ptq_edge_llm.py`, reproducible
without a GPU. The "unused" framing in the original removal commit
(`d89138a6`) only held inside this repo; the public API surface had an
external consumer.

## What this PR does

- Restores the four original submodules under
`modelopt/onnx/llm_export_utils/` verbatim from `d89138a6^`:
  - `__init__.py`
- `surgeon_utils.py` (contains the missing `fold_fp8_qdq_to_dq`, plus
`clear_inputs`, `clear_outputs`, `extract_layer_id`, `no_none_elements`)
- `export_utils.py` (`ModelLoader`, `WrapperModelForCausalLM`,
`RopeType`, `llm_to_onnx`, `torch_to_onnx`)
  - `quantization_utils.py` (`get_quant_config`, `quantize`)
- `__init__.py` emits a `DeprecationWarning` on import directing users
to `modelopt.onnx.export`, `modelopt.onnx.graph_surgery`, or
TensorRT-Edge-LLM.

The new `modelopt.onnx.export` and `modelopt.onnx.graph_surgery`
packages do **not** expose `fold_fp8_qdq_to_dq` (verified by `grep`), so
a pure import-redirect shim wouldn't have worked — the function itself
has to come back.

## Why a shim instead of fixing edgellm

We should still ship `tensorrt-edgellm` 0.6.2 that inlines the helper
and drops the import, but every existing 0.6.1 install (and the version
pin in its wheel — `nvidia-modelopt[torch,onnx]==0.39.0`) is broken in
the meantime, and pip doesn't downgrade modelopt when 0.44.0rc1 is
already installed. The shim unblocks them on the modelopt side
immediately.

## Usage

No new usage surface. External consumers continue to import the same
paths as before; they get a `DeprecationWarning` pointing them at the
successor APIs.

## Testing

Verified the failing import succeeds and the warning fires:

\`\`\`bash
\$ python -W default::DeprecationWarning -c \\
"from modelopt.onnx.llm_export_utils.surgeon_utils import
fold_fp8_qdq_to_dq; print('OK')"
<string>:1: DeprecationWarning: modelopt.onnx.llm_export_utils is
deprecated and will be removed in a future release. Use
modelopt.onnx.export and modelopt.onnx.graph_surgery, or migrate to
TensorRT-Edge-LLM (https://github.com/NVIDIA/TensorRT-Edge-LLM).
OK
\`\`\`

The four submodules \`import\` cleanly (\`surgeon_utils\`,
\`export_utils\`, \`quantization_utils\`). All pre-commit hooks pass
(ruff, mypy, license headers, bandit, RST formatting).

## Before your PR is "*Ready for review*"

- [x] Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (\`git commit -s -S\`).
- [x] Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors).
- Is this change backward compatible?: ✅ — restores a previously public
API; new code should still migrate.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in \`CONTRIBUTING.md\`: N/A
(restoration of code from this repo's own history).
- Did you write any new necessary tests?: ❌ — restoration of removed
code; the failing edgellm test path is the integration test.
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
❌ — intentionally omitted; this restores a previously public API as a
deprecated shim, and the prior removal wasn't itself called out in the
changelog.

## Additional Information

- Original removal: #1210 (\`b3feebfe\`) and prep commit \`d89138a6\`
"Remove unused llm_export_utils package".
- Followup: track edgellm-side fix (drop the modelopt import / inline
\`fold_fp8_qdq_to_dq\`) so the shim can be removed in a future major.
- Suggested cherry-pick to \`release/0.44.0\` so 0.44.0 GA ships without
the regression.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Deprecated**
* Legacy LLM ONNX export pipeline is now deprecated; users are directed
to newer alternatives for model export.

* **New Features**
* Added support for exporting HuggingFace causal language models to ONNX
format with dynamic shape support.
* Added quantization utilities supporting FP8, INT4-AWQ, and NVFP4
precisions for LLM models.
  * Added ONNX graph optimization utilities for quantized operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: Bug fix <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

<!-- Details about the change. -->
Cap kernels<0.13 and trackio<0.21 in examples/gpt-oss/requirements.txt.
Both newer versions require huggingface_hub>=1.x, but the example's
transformers pins huggingface_hub<1.0, so a fresh install breaks on
import (Unsupported type for field 'import_name': str | None from
kernels; cannot import name 'Volume' from trackio).

### Usage
No API change. On transformers<5.0, override the config's warmup_steps
with --warmup_ratio 0.03 --warmup_steps 0 (or edit the YAML), as already
noted by the comment in configs/sft_*.yaml.

```python
# Add a code snippet demonstrating how to use this
accelerate launch --config_file configs/zero3.yaml sft.py --config configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir gpt-oss-20b-qat --warmup_steps 0 --warmup_ratio 0.03
```

### Testing
1. pip install -r examples/gpt-oss/requirements.txt
    pip install transformers==4.57.3

    ```python
    # Add a code snippet demonstrating how to use this
accelerate launch --config_file configs/zero3.yaml sft.py --config
configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b
--quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir gpt-oss-20b-qat
--warmup_steps 0 --warmup_ratio 0.03
     ```

2. pip install -r examples/gpt-oss/requirements.txt
    pip install --upgrade transformers

    ```python
    # Add a code snippet demonstrating how to use this
accelerate launch --config_file configs/zero3.yaml sft.py --config
configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b
--quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir gpt-oss-20b-qat
     ```

<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information
<!-- E.g. related issue. -->

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 changed the title [Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 [Cherry-pick] PRs #1352 #1351 #1330 #1354 #1355 #1360 #1342 #1324 #1340 #1368 #1373 #1359 #1361 #1325 #1369 #1370 #1371 #1375 #1386 #1353 #1356 #1390 May 5, 2026
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)

1058-1059: ⚡ Quick win

Extend parametrization to cover Quantize op variants too.

The production logic now handles four op types; this test currently validates only the two DQ variants. Adding the Q variants would close the regression gap with a very small change.

Suggested diff
-    `@pytest.mark.parametrize`("dq_op_type", ["DequantizeLinear", "TRT_INT4DequantizeLinear"])
+    `@pytest.mark.parametrize`(
+        "dq_op_type",
+        [
+            "QuantizeLinear",
+            "DequantizeLinear",
+            "TRT_INT4QuantizeLinear",
+            "TRT_INT4DequantizeLinear",
+        ],
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1058 - 1059, The
test test_zero_scale_initializer_fed_to_dq_is_patched only parametrizes
dq_op_type with the two Dequantize variants; update its `@pytest.mark.parametrize`
to include the corresponding Quantize variants ("QuantizeLinear" and
"TRT_INT4QuantizeLinear") so the test covers all four op types the production
logic handles (reference the dq_op_type parameter and the
test_zero_scale_initializer_fed_to_dq_is_patched function to locate the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/onnx/llm_export_utils/export_utils.py`:
- Around line 79-91: The forward implementation (e.g.,
WrapperModelForCausalLM.forward) currently only accepts input_ids and
past_key_values and builds a DynamicCache, so any advertised extra VL inputs
never get forwarded to the underlying model; change the signature to accept and
propagate arbitrary extra inputs (e.g., **extra_inputs or **kwargs) and pass
them through when calling self.model (outputs = self.model(input_ids=input_ids,
past_key_values=past_key_values, use_cache=True, **extra_inputs)), and do the
same update for the similar forward block at lines 132-139 so both wrappers
accept and forward extra inputs to the wrapped model.
- Around line 39-45: The constructor currently hard-sets self.rope_type to
RopeType.K_ROPE_ROTATE_NEOX causing get_rope_type()/rope classification to
always return NEOX; change __init__ in export_utils.ModelLoader to derive
rope_type from the model config instead (e.g., call the existing get_rope_type()
or parse self.config_path) and assign its result to self.rope_type rather than
the constant; also ensure any later initialization code around the same area
(the block referenced at lines 61-63) does not override this derived value and
uses the derived self.rope_type for downstream rope-specific export handling.
- Around line 23-24: Replace the module-level import of transformers with a lazy
import: keep "import torch" as-is but wrap "from transformers import
AutoModelForCausalLM, DynamicCache" inside a with import_plugin("transformers"):
block so transformers is only loaded when needed; use the import_plugin context
manager and import the symbols AutoModelForCausalLM and DynamicCache inside that
block.

In `@modelopt/onnx/llm_export_utils/quantization_utils.py`:
- Around line 58-113: The lm_head_precision handling in get_quant_config
(function get_quant_config) is inconsistent with quantize's validation: either
expand quantize's allowed lm_head_precision values to include "fp8" and "nvfp4"
or restrict get_quant_config to only accept "fp16"; update the validation logic
in the quantize function (or the caller that enforces lm_head_precision) so it
permits "fp8" and "nvfp4" when those branches in get_quant_config are intended
to be used, and ensure errors use the same set of supported strings in both
places to keep behavior consistent.
- Around line 132-135: Replace the incorrect pad-token logic that uses a Bandit
bypass; instead of "if tokenizer.pad_token != '<unk>': ... # nosec", change to
set tokenizer.pad_token to tokenizer.eos_token only when it is missing or equals
the placeholder "<unk>" (i.e., if tokenizer.pad_token is None or
tokenizer.pad_token == "<unk>"), and remove the "# nosec B105" comment; update
the code referencing tokenizer.pad_token and tokenizer.eos_token accordingly.

In `@modelopt/onnx/llm_export_utils/surgeon_utils.py`:
- Around line 83-115: The code only rewrites the first consumer dq_op
(node.outputs[0].outputs[0]) and then clears node.outputs, leaving other
consumers pointing to the old tensor; update the implementation so you iterate
all consumers of node.outputs[0] (e.g., for consumer in node.outputs[0].outputs)
and replace each consumer's corresponding input that references the original
quantized output with the new onnx_weights_fp8 (created with LazyValues) and
adjust op/type as needed (for the DequantizeLinear case set consumer.op =
"DequantizeLinear" and adjust consumer.outputs[0].dtype =
consumer.inputs[1].dtype); only after all consumers are updated call
node.outputs.clear(), ensuring TRT_FP8QuantizeLinear, onnx_weights_fp8,
LazyValues, dq_op, and node.outputs are the reference points when making these
changes.
- Around line 27-40: clear_inputs and clear_outputs currently wipe the entire
neighbor adjacency (i.outputs.clear() / o.inputs.clear()) which severs unrelated
edges; instead, for clear_inputs(node) iterate over node.inputs and remove only
this node from each input's outputs (e.g., i.outputs.remove(node) or filter it
out) before clearing node.inputs, and for clear_outputs(node) iterate over
node.outputs and remove only this node from each output's inputs before clearing
node.outputs; update both functions (clear_inputs, clear_outputs) to perform
targeted removal and handle cases where the neighbor list may not contain the
node (safe remove).

---

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1058-1059: The test
test_zero_scale_initializer_fed_to_dq_is_patched only parametrizes dq_op_type
with the two Dequantize variants; update its `@pytest.mark.parametrize` to include
the corresponding Quantize variants ("QuantizeLinear" and
"TRT_INT4QuantizeLinear") so the test covers all four op types the production
logic handles (reference the dq_op_type parameter and the
test_zero_scale_initializer_fed_to_dq_is_patched function to locate the change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d226101f-3236-4f96-b9df-be10220f82b4

📥 Commits

Reviewing files that changed from the base of the PR and between 3720a7a and 6b9f370.

📒 Files selected for processing (13)
  • .github/workflows/unit_tests.yml
  • CHANGELOG.rst
  • docs/source/getting_started/_installation_for_Linux.rst
  • examples/gpt-oss/requirements.txt
  • modelopt/onnx/llm_export_utils/__init__.py
  • modelopt/onnx/llm_export_utils/export_utils.py
  • modelopt/onnx/llm_export_utils/quantization_utils.py
  • modelopt/onnx/llm_export_utils/surgeon_utils.py
  • modelopt/onnx/quantization/qdq_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • noxfile.py
  • pyproject.toml
  • tests/unit/onnx/quantization/test_qdq_utils.py
✅ Files skipped from review due to trivial changes (4)
  • docs/source/getting_started/_installation_for_Linux.rst
  • pyproject.toml
  • examples/gpt-oss/requirements.txt
  • modelopt/onnx/llm_export_utils/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.rst

Comment on lines +23 to +24
import torch
from transformers import AutoModelForCausalLM, DynamicCache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Compare this file against existing optional-dependency patterns in the repo.
rg -n -C2 'import_plugin|optional dependenc|lazy import' modelopt
rg -n -C2 'from transformers import|import transformers' modelopt
rg -n -C2 'llm_export_utils' modelopt

Repository: NVIDIA/Model-Optimizer

Length of output: 50380


Lazy-load transformers using import_plugin() context manager.

The file imports transformers at module level (lines 23-24), which breaks importability for users without the [hf] extra. Following the established pattern throughout the codebase, wrap these imports within with import_plugin("transformers"): to defer loading until the module is actually needed.

with import_plugin("transformers"):
    from transformers import AutoModelForCausalLM, DynamicCache

This aligns with the guideline: "Avoid hard imports of optional dependencies at module level; features should be gated by install extras ([onnx], [hf], [all]) and loaded lazily via import_plugin()."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/export_utils.py` around lines 23 - 24, Replace
the module-level import of transformers with a lazy import: keep "import torch"
as-is but wrap "from transformers import AutoModelForCausalLM, DynamicCache"
inside a with import_plugin("transformers"): block so transformers is only
loaded when needed; use the import_plugin context manager and import the symbols
AutoModelForCausalLM and DynamicCache inside that block.

Comment on lines +39 to +45
def __init__(self, hf_model_path: str, config_path: str):
"""Initialize the ModelLoader."""
self.config_path = config_path
self.hf_model_path = hf_model_path
self.model_type = self.get_model_type()
self.hf_model = None
self.rope_type = RopeType.K_ROPE_ROTATE_NEOX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't hard-code every model to NEOX rope.

get_rope_type() currently returns K_ROPE_ROTATE_NEOX for every model because self.rope_type is never derived from the config. That will misclassify GPT-J and M-RoPE models before any rope-specific export handling runs.

Also applies to: 61-63

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/export_utils.py` around lines 39 - 45, The
constructor currently hard-sets self.rope_type to RopeType.K_ROPE_ROTATE_NEOX
causing get_rope_type()/rope classification to always return NEOX; change
__init__ in export_utils.ModelLoader to derive rope_type from the model config
instead (e.g., call the existing get_rope_type() or parse self.config_path) and
assign its result to self.rope_type rather than the constant; also ensure any
later initialization code around the same area (the block referenced at lines
61-63) does not override this derived value and uses the derived self.rope_type
for downstream rope-specific export handling.

Comment on lines +79 to +91
def forward(self, input_ids: torch.Tensor | None, past_key_values: tuple):
"""Forward pass."""
# Convert tuple cache to DynamicCache for models that require it (e.g., Qwen3)
cache = DynamicCache(config=self.config)
cache.key_cache = [kv[0] for kv in past_key_values]
cache.value_cache = [kv[1] for kv in past_key_values]
past_key_values = cache

outputs = self.model(input_ids=input_ids, past_key_values=past_key_values, use_cache=True)
hidden_states = outputs[0]
past_key_values = outputs.past_key_values.to_legacy_cache()
logits = self.lm_head(hidden_states)
return logits, past_key_values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

extra_inputs cannot reach the wrapped model today.

llm_to_onnx() advertises extra VL inputs, but WrapperModelForCausalLM.forward() only accepts input_ids and past_key_values and never forwards arbitrary kwargs to self.model. As soon as extra_inputs is non-empty, this helper cannot exercise the path it claims to support.

🛠️ Suggested fix
-    def forward(self, input_ids: torch.Tensor | None, past_key_values: tuple):
+    def forward(
+        self,
+        input_ids: torch.Tensor | None,
+        past_key_values: tuple,
+        **extra_inputs,
+    ):
         """Forward pass."""
         # Convert tuple cache to DynamicCache for models that require it (e.g., Qwen3)
         cache = DynamicCache(config=self.config)
         cache.key_cache = [kv[0] for kv in past_key_values]
         cache.value_cache = [kv[1] for kv in past_key_values]
         past_key_values = cache
 
-        outputs = self.model(input_ids=input_ids, past_key_values=past_key_values, use_cache=True)
+        outputs = self.model(
+            input_ids=input_ids,
+            past_key_values=past_key_values,
+            use_cache=True,
+            **extra_inputs,
+        )

Also applies to: 132-139

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/export_utils.py` around lines 79 - 91, The
forward implementation (e.g., WrapperModelForCausalLM.forward) currently only
accepts input_ids and past_key_values and builds a DynamicCache, so any
advertised extra VL inputs never get forwarded to the underlying model; change
the signature to accept and propagate arbitrary extra inputs (e.g.,
**extra_inputs or **kwargs) and pass them through when calling self.model
(outputs = self.model(input_ids=input_ids, past_key_values=past_key_values,
use_cache=True, **extra_inputs)), and do the same update for the similar forward
block at lines 132-139 so both wrappers accept and forward extra inputs to the
wrapped model.

Comment on lines +58 to +113
def get_quant_config(precision, lm_head_precision="fp16"):
"""Get the quantization configuration."""
if precision == "fp8":
quant_cfg = copy.deepcopy(mtq.FP8_DEFAULT_CFG)

elif precision == "nvfp4":
quant_cfg = copy.deepcopy(mtq.NVFP4_DEFAULT_CFG)

elif precision == "int4_awq":
quant_cfg = copy.deepcopy(mtq.INT4_AWQ_CFG) # type: ignore[arg-type]

else:
raise ValueError(f"Unsupported precision: {precision}")

quant_cfg_list: list = [
e for e in quant_cfg["quant_cfg"] if isinstance(e, dict) and "quantizer_name" in e
]

if lm_head_precision == "fp8":
quant_cfg_list.append(
{
"quantizer_name": "*lm_head.input_quantizer",
"cfg": {"num_bits": (4, 3), "axis": None},
}
)
quant_cfg_list.append(
{
"quantizer_name": "*lm_head.weight_quantizer",
"cfg": {"num_bits": (4, 3), "axis": None},
}
)
elif lm_head_precision == "nvfp4":
quant_cfg_list.append(
{
"quantizer_name": "*lm_head.input_quantizer",
"cfg": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"axis": None,
},
"enable": True,
}
)
quant_cfg_list.append(
{
"quantizer_name": "*lm_head.weight_quantizer",
"cfg": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"axis": None,
},
"enable": True,
}
)
quant_cfg["quant_cfg"] = quant_cfg_list
return quant_cfg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make lm_head_precision validation match the supported configs.

get_quant_config() has explicit fp8 and nvfp4 branches for lm_head_precision, but quantize() rejects everything except "fp16". Right now those branches are dead code and callers cannot use the advertised modes.

🛠️ Suggested fix
-    assert lm_head_precision in ["fp16"], (
-        f"Only fp16(unquantized) is supported for lm_head. You passed an unsupported precision: {lm_head_precision}."
+    assert lm_head_precision in ["fp16", "fp8", "nvfp4"], (
+        f"Only fp16, fp8, and nvfp4 are supported for lm_head. You passed an unsupported precision: {lm_head_precision}."
     )

Also applies to: 116-130

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/quantization_utils.py` around lines 58 - 113,
The lm_head_precision handling in get_quant_config (function get_quant_config)
is inconsistent with quantize's validation: either expand quantize's allowed
lm_head_precision values to include "fp8" and "nvfp4" or restrict
get_quant_config to only accept "fp16"; update the validation logic in the
quantize function (or the caller that enforces lm_head_precision) so it permits
"fp8" and "nvfp4" when those branches in get_quant_config are intended to be
used, and ensure errors use the same set of supported strings in both places to
keep behavior consistent.

Comment on lines +132 to +135
if tokenizer.pad_token != "<unk>": # nosec B105
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the Bandit bypass and flip this pad-token fallback.

The current condition overwrites any existing pad token except "<unk>" with eos_token, which is the opposite of the intended fallback. The # nosec suppression is also blocked by SECURITY.md.

🛠️ Suggested fix
-    if tokenizer.pad_token != "<unk>":  # nosec B105
-        tokenizer.pad_token = tokenizer.eos_token
-    if tokenizer.pad_token is None:
+    if tokenizer.pad_token in {None, tokenizer.unk_token}:
         tokenizer.pad_token = tokenizer.eos_token

As per coding guidelines, "Any use of '# nosec' comments to bypass Bandit security checks is not allowed."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/quantization_utils.py` around lines 132 - 135,
Replace the incorrect pad-token logic that uses a Bandit bypass; instead of "if
tokenizer.pad_token != '<unk>': ... # nosec", change to set tokenizer.pad_token
to tokenizer.eos_token only when it is missing or equals the placeholder "<unk>"
(i.e., if tokenizer.pad_token is None or tokenizer.pad_token == "<unk>"), and
remove the "# nosec B105" comment; update the code referencing
tokenizer.pad_token and tokenizer.eos_token accordingly.

Comment on lines +27 to +40
def clear_inputs(node: gs.Node | gs.Tensor):
"""Clear all inputs for a node or tensor in ONNX."""
for i in node.inputs:
i.outputs.clear()
node.inputs.clear()
return node


def clear_outputs(node: gs.Node | gs.Tensor):
"""Clear all outputs for a node or tensor in ONNX."""
for o in node.outputs:
o.inputs.clear()
node.outputs.clear()
return node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Detach only the current edge here.

Both helpers clear the neighbor's entire adjacency list, not just the connection to node. On shared tensors that will disconnect unrelated producers/consumers and corrupt the graph.

🛠️ Suggested fix
 def clear_inputs(node: gs.Node | gs.Tensor):
     """Clear all inputs for a node or tensor in ONNX."""
-    for i in node.inputs:
-        i.outputs.clear()
+    for producer in list(node.inputs):
+        if node in producer.outputs:
+            producer.outputs.remove(node)
     node.inputs.clear()
     return node
 
 
 def clear_outputs(node: gs.Node | gs.Tensor):
     """Clear all outputs for a node or tensor in ONNX."""
-    for o in node.outputs:
-        o.inputs.clear()
+    for consumer in list(node.outputs):
+        if node in consumer.inputs:
+            consumer.inputs.remove(node)
     node.outputs.clear()
     return node
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/surgeon_utils.py` around lines 27 - 40,
clear_inputs and clear_outputs currently wipe the entire neighbor adjacency
(i.outputs.clear() / o.inputs.clear()) which severs unrelated edges; instead,
for clear_inputs(node) iterate over node.inputs and remove only this node from
each input's outputs (e.g., i.outputs.remove(node) or filter it out) before
clearing node.inputs, and for clear_outputs(node) iterate over node.outputs and
remove only this node from each output's inputs before clearing node.outputs;
update both functions (clear_inputs, clear_outputs) to perform targeted removal
and handle cases where the neighbor list may not contain the node (safe remove).

Comment on lines +83 to +115
for node in graph.nodes:
if node.op == "TRT_FP8QuantizeLinear":
# Should not remove input QDQ
if not isinstance(node.inputs[0], gs.Constant):
continue

weights = node.inputs[0]
scale = node.inputs[1]
torch_weights = torch.from_numpy(weights.values)
torch_scale = torch.from_numpy(scale.values)
quantizer_name = scale.name.rsplit("/", 1)[0]
dq_op = node.outputs[0].outputs[0]
assert dq_op.op == "TRT_FP8DequantizeLinear", (
f"QDQ does not occur in pairs. You reached {dq_op.op}"
)

# Replace it with Dequantize with FP8 weights. This is a WAR because numpy does not support fp8.
numpy_weights = (
(torch_weights / torch_scale).to(torch.float8_e4m3fn).view(torch.uint8).numpy()
)
tensor = onnx.TensorProto()
tensor.data_type = onnx.TensorProto.FLOAT8E4M3FN
tensor.dims.extend(numpy_weights.shape)
tensor.raw_data = numpy_weights.tobytes()
values = LazyValues(tensor)
onnx_weights_fp8 = gs.Constant(quantizer_name + "/fp8_weights", values)

node.outputs.clear()
# DQ Op is separated out
dq_op.inputs[0] = onnx_weights_fp8
dq_op.op = "DequantizeLinear"
dq_op.outputs[0].dtype = dq_op.inputs[1].dtype

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle multi-consumer Q outputs before rewriting them.

This path only rewrites node.outputs[0].outputs[0]. If the quantized tensor feeds more than one consumer, the remaining consumers still reference the old tensor after node.outputs.clear().

🛠️ Suggested fix
-            dq_op = node.outputs[0].outputs[0]
+            consumers = list(node.outputs[0].outputs)
+            if len(consumers) != 1:
+                raise ValueError(
+                    f"Expected exactly one consumer for {node.name or node.op}, found {len(consumers)}"
+                )
+            dq_op = consumers[0]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/llm_export_utils/surgeon_utils.py` around lines 83 - 115, The
code only rewrites the first consumer dq_op (node.outputs[0].outputs[0]) and
then clears node.outputs, leaving other consumers pointing to the old tensor;
update the implementation so you iterate all consumers of node.outputs[0] (e.g.,
for consumer in node.outputs[0].outputs) and replace each consumer's
corresponding input that references the original quantized output with the new
onnx_weights_fp8 (created with LazyValues) and adjust op/type as needed (for the
DequantizeLinear case set consumer.op = "DequantizeLinear" and adjust
consumer.outputs[0].dtype = consumer.inputs[1].dtype); only after all consumers
are updated call node.outputs.clear(), ensuring TRT_FP8QuantizeLinear,
onnx_weights_fp8, LazyValues, dq_op, and node.outputs are the reference points
when making these changes.

@kevalmorabia97 kevalmorabia97 merged commit cc06062 into release/0.44.0 May 5, 2026
39 of 45 checks passed
@kevalmorabia97 kevalmorabia97 deleted the cherry-picks/release-0.44.0 branch May 5, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.