Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5d50c73
Add release-cherry-pick Claude Code skill (#1352)
kevalmorabia97 Apr 27, 2026
57eb6b7
chore: Move FP8 MHA quantization entry from 0.45 to 0.44 in CHANGELOG…
ajrasane Apr 27, 2026
816da0f
fix incomplete mapping of safetensors in generated puzzletron checkpo…
grzegorz-k-karch Apr 27, 2026
8abe394
[NVBUG: 6103846] Fix nvfp4_awq export for uncalibrated MoE experts (#…
cjluo-nv Apr 27, 2026
3383a20
Fix regex capture for Megatron KD PP layer renaming (#1355)
AAnoosheh Apr 27, 2026
34d554f
Add required keys to attention pruning config (#1360)
grzegorz-k-karch Apr 28, 2026
484acbf
Support EP mcore import for TE Spec and Fix mamba moe config (#1342)
jenchen13 Apr 28, 2026
e906fb6
[NVBug 6102977] Add _disable_use_cache context manager to fix PTQ Att…
meenchen Apr 28, 2026
82a856d
[NVBug 6108145] Fix PTQ calibration and export for fused-experts MoE …
meenchen Apr 29, 2026
f06190b
[BUG6108338] Update windows documentation for onnxruntime quantizatio…
ynankani Apr 29, 2026
4879789
[Fix]: Relax Dflash Rregression Test Threshold fo 2GPUs (#1373)
h-guo18 Apr 29, 2026
1aa5775
Ensure removal of temp files on error in ONNX INT4 quantization (#1359)
vishalpandya1990 Apr 30, 2026
e88857d
[6034518] Remove return statement preventing remote auto tuning (#1361)
dthienan-nv Apr 30, 2026
bc89749
Add Nemotron-Nano-9B-v2 → Pruned 7B e2e tutorial: Prune + Distill + E…
kevalmorabia97 Apr 30, 2026
0f9ef85
Added fallback to load extra cudnn dlls in the site packages (#1369)
hthadicherla May 4, 2026
d3d519d
fix: include medusa in data_module assignment in main.py (#1370)
yeyu-nvidia May 4, 2026
1b2f029
fix: guard against None chat_template in _post_process_chat_template …
yeyu-nvidia May 4, 2026
3720a7a
Increase gpu_tests CI timeout from 60 to 75 mins
kevalmorabia97 May 4, 2026
1f619ce
Fix sparsity-only export emitting empty hf_quant_config.json (#1375)
kaix-nv May 4, 2026
c07841a
Enable Python 3.14 wheel support to unblock NGC PyTorch container tes…
kevalmorabia97 May 4, 2026
55c338f
[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export (#1353)
ajrasane May 5, 2026
fc24813
[6106576] Restore llm_export_utils as deprecated shim for edgellm 0.6…
ajrasane May 5, 2026
e3d3321
Fix gpt-oss examples trl import error (#1390)
sugunav14 May 5, 2026
6b9f370
Remove test_dflash_offline.py regression test
kevalmorabia97 May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions .claude/skills/release-cherry-pick/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
name: release-cherry-pick
description: Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".
---

# Cherry-pick PRs to a Release Branch

Cherry-pick all merged `main` PRs labeled `cherry-pick-<version>` (but not `cherry-pick-done`) into the corresponding `release/<version>` branch, one by one in merge order.

## Step 1 — Identify the target version

Ask the user for the release version (e.g. `0.44.0`) if not already provided.

Set `VERSION=<version>` for use in subsequent steps.

## Step 2 — Fetch pending PRs

Use the GitHub search API to list PRs that have the cherry-pick label but not cherry-pick-done, sorted by merge date ascending:

```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
Comment on lines +20 to +24
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle pagination for repositories with many pending cherry-picks.

The GitHub API query uses per_page=50, which limits results to 50 PRs. If more than 50 PRs are labeled cherry-pick-<VERSION> without cherry-pick-done, later PRs will be silently omitted from the cherry-pick batch.

📄 Suggested approaches

Option 1: Increase the page size to a safer limit:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \

Option 2: Add pagination using --paginate:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api --paginate "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \
   --jq '.items[] | [.number, .title, .pull_request.merged_at] | `@tsv`' \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 20 - 24, The gh api
command in the SKILL.md snippet uses per_page=50 which will truncate results
when more than 50 matching PRs exist; update the command that constructs the
search (the gh api "search/issues?q=...+per_page=50" invocation) to handle
pagination—either increase per_page to the GitHub max (100) and/or add
--paginate to the gh api call so all pages are returned, and ensure the pipeline
(the jq and sort usage) still consumes streamed results correctly.


Present the list to the user before proceeding.

## Step 3 — Set up the release branch

Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:

```bash
git fetch origin release/<VERSION>
git checkout release/<VERSION>
```
Comment on lines +28 to +35
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify release branch creation behavior.

The description states "creating it from the remote if it doesn't exist locally," but the provided commands will fail if the release/<VERSION> branch doesn't exist on the remote. The commands only check out an existing remote branch—they don't create it.

📝 Suggested clarification

Either update the description to reflect the actual behavior:

-Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:
+Check out the existing `release/<VERSION>` branch from the remote:

Or provide commands that handle both cases:

+If the release branch doesn't exist yet, create it from main:
+
+```bash
+git fetch origin
+if git rev-parse --verify origin/release/<VERSION> >/dev/null 2>&1; then
+  git checkout release/<VERSION>
+else
+  git checkout -b release/<VERSION> origin/main
+  git push -u origin release/<VERSION>
+fi
+```
+
+Otherwise, check out the existing branch:
+
 ```bash
 git fetch origin release/<VERSION>
 git checkout release/<VERSION>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 28 - 35, The
current text says "creating it from the remote if it doesn't exist locally" but
the shown git fetch/checkout will fail if release/ doesn't exist on the
remote; update the Step 3 wording to reflect that, or replace the commands with
logic that first git fetch origin, then test for origin/release/ (e.g.,
using git rev-parse --verify origin/release/) and if it exists run git
checkout release/, otherwise create the branch locally from main (git
checkout -b release/ origin/main) and push it upstream (git push -u
origin release/).


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- 4e71b3a2 -->

<!-- This is an auto-generated comment by CodeRabbit -->


## Step 4 — Get merge commit SHAs

All PRs are squash-merged, so each has a single-parent commit. Retrieve the SHA for each PR:

```bash
gh pr view <NUM> --repo NVIDIA/Model-Optimizer --json mergeCommit --jq '.mergeCommit.oid'
```

## Step 5 — Cherry-pick in merge order

Cherry-pick each commit with `-s` (DCO sign-off). GPG signing is handled automatically by the repo's git config.

```bash
git cherry-pick -s <SHA>
```

**On conflict:** Tell the user which PR caused the conflict and ask them to fix it, then continue:

```bash
git cherry-pick --continue
```

## Step 6 — Create a PR to the release branch

Push the cherry-picks to a new branch and open a PR targeting `release/<VERSION>`. The PR title lists every cherry-picked PR number. The body uses `## Cherry-picked PRs` as the only heading with one `- #<NUM>` bullet per PR — no titles, no links, no extra text.

```bash
git checkout -B cherry-picks/release-<VERSION>
git push -u origin cherry-picks/release-<VERSION>

gh pr create \
--title "[Cherry-pick] PRs #<NUM1> #<NUM2> ..." \
--base release/<VERSION> \
--head cherry-picks/release-<VERSION> \
--body "$(cat <<'EOF'
## Cherry-picked PRs

- #<NUM1>
- #<NUM2>
...
EOF
)"
```

## Step 7 — Apply cherry-pick-done label

Add the `cherry-pick-done` label to every PR that was successfully cherry-picked:

```bash
for pr in <NUM1> <NUM2> ...; do
gh pr edit $pr --repo NVIDIA/Model-Optimizer --add-label "cherry-pick-done"
done
```
2 changes: 1 addition & 1 deletion .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
matrix:
include:
- example: gpu
timeout: 60
timeout: 75
container_image: pytorch:26.01-py3
# tests/gpu/_extensions/test_onnx_extensions.py fails for newer containers until https://github.com/tbenthompson/cppimport/pull/98
- example: gpu_megatron
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/unit_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@ jobs:
- {nox_session: "unit-3.10(torch_211, tf_latest)", python_version: "3.10"}
- {nox_session: "unit-3.11(torch_211, tf_latest)", python_version: "3.11"}
- {nox_session: "unit-3.13(torch_211, tf_latest)", python_version: "3.13"}
- {nox_session: "unit-3.14(torch_211, tf_latest)", python_version: "3.14"}
- {nox_session: "unit-3.12(torch_28, tf_latest)", python_version: "3.12"}
- {nox_session: "unit-3.12(torch_29, tf_latest)", python_version: "3.12"}
- {nox_session: "unit-3.12(torch_210, tf_latest)", python_version: "3.12"}
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Changelog
**New Features**

- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
- Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See `examples/pruning/minitron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/>`_ for details.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
Expand All @@ -17,6 +18,7 @@ Changelog
- [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
- Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.

**Backward Breaking Changes**

Expand All @@ -29,13 +31,15 @@ Changelog
- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
- Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the ``trtexec`` version instead of the TRT Python API when using ``trtexec`` backend.
- Exclude MatMul/Gemm nodes with K or N < 16 from ONNX INT8 and FP8 quantization. Such small-dimension GEMMs cannot efficiently use INT8/FP8 Tensor Cores and the added Q/DQ layers cause perf regressions in TensorRT. Honors Gemm ``transB`` when deriving K.
- Fix ``nvfp4_awq`` export ``AssertionError: Modules have different quantization formats`` for MoE models (e.g. Qwen3-30B-A3B) when some experts are not exercised by the calibration data. ``awq_lite`` now applies a neutral all-ones ``pre_quant_scale`` to any expert that ends up disabled (no cache-pass tokens, NaN scales, or no search-pass tokens) so its format remains ``nvfp4_awq``, consistent with the rest of the MoE block. A warning is emitted whenever this fallback fires.

**Misc**

- [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
- Bump minimum required PyTorch version to 2.8.
- [Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
- Add installation support for Python 3.14. Only basic unit tests are verified for now. Production usage still defaults to Python 3.12. Python 3.10 support will be dropped in the next release.

0.43 (2026-04-16)
^^^^^^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/_installation_for_Linux.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
+-------------------------+-----------------------------+
| Architecture | x86_64, aarch64 (SBSA) |
+-------------------------+-----------------------------+
| Python | >=3.10,<3.14 |
| Python | >=3.10,<3.15 |
+-------------------------+-----------------------------+
| CUDA | 12.x, 13.x |
+-------------------------+-----------------------------+
Expand Down
16 changes: 16 additions & 0 deletions docs/source/getting_started/windows/_installation_standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,22 @@ If you need to use any other EP for calibration, you can uninstall the existing

By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process. This is compatible with CUDA 12.x.

If you are using CUDA 13.x, update CUDA-dependent packages manually:

For official ONNX Runtime guidance, see `Nightly builds for CUDA 13.x <https://onnxruntime.ai/docs/install/#nightly-for-cuda-13x>`_.

1. Uninstall ``cupy-cuda12x`` and install ``cupy-cuda13x``.
2. Uninstall ``onnxruntime-genai-cuda`` and ``onnxruntime-gpu``.
3. Install ONNX Runtime CUDA 13 nightly and the pre-release ``onnxruntime-genai-cuda`` package.

.. code-block:: bash

pip uninstall -y cupy-cuda12x onnxruntime-genai-cuda onnxruntime-gpu
pip install cupy-cuda13x
pip install coloredlogs flatbuffers numpy packaging protobuf sympy
pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ onnxruntime-gpu
pip install --pre onnxruntime-genai-cuda

**6. Verify Installation**

Ensure the following steps are verified:
Expand Down
Loading
Loading