NVIDIA · kevalmorabia97 · May 5, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/.claude/skills/release-cherry-pick/SKILL.md b/.claude/skills/release-cherry-pick/SKILL.md
@@ -0,0 +1,89 @@
+---
+name: release-cherry-pick
+description: Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".
+---
+
+# Cherry-pick PRs to a Release Branch
+
+Cherry-pick all merged `main` PRs labeled `cherry-pick-<version>` (but not `cherry-pick-done`) into the corresponding `release/<version>` branch, one by one in merge order.
+
+## Step 1 — Identify the target version
+
+Ask the user for the release version (e.g. `0.44.0`) if not already provided.
+
+Set `VERSION=<version>` for use in subsequent steps.
+
+## Step 2 — Fetch pending PRs
+
+Use the GitHub search API to list PRs that have the cherry-pick label but not cherry-pick-done, sorted by merge date ascending:
+
+```bash
+gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+  --jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
+  | sort -t$'\t' -k3
+```
-```bash
-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
-  --jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
-  | sort -t$'\t' -k3
-```
-```bash
-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
-  --jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
-  | sort -t$'\t' -k3
-```
+
+Present the list to the user before proceeding.
+
+## Step 3 — Set up the release branch
+
+Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:
+
+```bash
+git fetch origin release/<VERSION>
+git checkout release/<VERSION>
+```
+
+## Step 4 — Get merge commit SHAs
+
+All PRs are squash-merged, so each has a single-parent commit. Retrieve the SHA for each PR:
+
+```bash
+gh pr view <NUM> --repo NVIDIA/Model-Optimizer --json mergeCommit --jq '.mergeCommit.oid'
+```
+
+## Step 5 — Cherry-pick in merge order
+
+Cherry-pick each commit with `-s` (DCO sign-off). GPG signing is handled automatically by the repo's git config.
+
+```bash
+git cherry-pick -s <SHA>
+```
+
+**On conflict:** Tell the user which PR caused the conflict and ask them to fix it, then continue:
+
+```bash
+git cherry-pick --continue
+```
+
+## Step 6 — Create a PR to the release branch
+
+Push the cherry-picks to a new branch and open a PR targeting `release/<VERSION>`. The PR title lists every cherry-picked PR number. The body uses `## Cherry-picked PRs` as the only heading with one `- #<NUM>` bullet per PR — no titles, no links, no extra text.
+
+```bash
+git checkout -B cherry-picks/release-<VERSION>
+git push -u origin cherry-picks/release-<VERSION>
+
+gh pr create \
+  --title "[Cherry-pick] PRs #<NUM1> #<NUM2> ..." \
+  --base release/<VERSION> \
+  --head cherry-picks/release-<VERSION> \
+  --body "$(cat <<'EOF'
+## Cherry-picked PRs
+
+- #<NUM1>
+- #<NUM2>
+...
+EOF
+)"
+```
+
+## Step 7 — Apply cherry-pick-done label
+
+Add the `cherry-pick-done` label to every PR that was successfully cherry-picked:
+
+```bash
+for pr in <NUM1> <NUM2> ...; do
+  gh pr edit $pr --repo NVIDIA/Model-Optimizer --add-label "cherry-pick-done"
+done
+```
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -7,6 +7,7 @@ Changelog
 **New Features**
 
 - Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
+- Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See `examples/pruning/minitron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/>`_ for details.
 - Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
 - Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
 - Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
@@ -17,6 +18,7 @@ Changelog
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
 - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
 - Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
+- Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.
 
 **Backward Breaking Changes**
 
@@ -29,6 +31,7 @@ Changelog
 - Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
 - Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the ``trtexec`` version instead of the TRT Python API when using ``trtexec`` backend.
 - Exclude MatMul/Gemm nodes with K or N < 16 from ONNX INT8 and FP8 quantization. Such small-dimension GEMMs cannot efficiently use INT8/FP8 Tensor Cores and the added Q/DQ layers cause perf regressions in TensorRT. Honors Gemm ``transB`` when deriving K.
+- Fix ``nvfp4_awq`` export ``AssertionError: Modules have different quantization formats`` for MoE models (e.g. Qwen3-30B-A3B) when some experts are not exercised by the calibration data. ``awq_lite`` now applies a neutral all-ones ``pre_quant_scale`` to any expert that ends up disabled (no cache-pass tokens, NaN scales, or no search-pass tokens) so its format remains ``nvfp4_awq``, consistent with the rest of the MoE block. A warning is emitted whenever this fallback fires.
 
 **Misc**
 

diff --git a/docs/source/getting_started/windows/_installation_standalone.rst b/docs/source/getting_started/windows/_installation_standalone.rst
@@ -64,6 +64,22 @@ If you need to use any other EP for calibration, you can uninstall the existing
 
 By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process. This is compatible with CUDA 12.x.
 
+If you are using CUDA 13.x, update CUDA-dependent packages manually:
+
+For official ONNX Runtime guidance, see `Nightly builds for CUDA 13.x <https://onnxruntime.ai/docs/install/#nightly-for-cuda-13x>`_.
+
+1. Uninstall ``cupy-cuda12x`` and install ``cupy-cuda13x``.
+2. Uninstall ``onnxruntime-genai-cuda`` and ``onnxruntime-gpu``.
+3. Install ONNX Runtime CUDA 13 nightly and the pre-release ``onnxruntime-genai-cuda`` package.
+
+.. code-block:: bash
+
+    pip uninstall -y cupy-cuda12x onnxruntime-genai-cuda onnxruntime-gpu
+    pip install cupy-cuda13x
+    pip install coloredlogs flatbuffers numpy packaging protobuf sympy
+    pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ onnxruntime-gpu
+    pip install --pre onnxruntime-genai-cuda
+
 **6. Verify Installation**
 
 Ensure the following steps are verified:

@@ -0,0 +1,242 @@
+# Tokenizing for Megatron Frameworks
+
+| **Section** | **Description** | **Link** |
+| :---: | :---: | :---: |
+| From JSONL files | Tokenize local JSONL files | \[[Link](#from-jsonl-files)\] |
+| From Hugging Face Hub | Stream or download HF datasets and tokenize | \[[Link](#from-hugging-face-hub)\] |
+| `reasoning_content` for Post-Training v3 | Control how chain-of-thought traces are handled | \[[Link](#reasoning_content-for-post-training-v3-datasets)\] |
+| Nemotron Pre/Post-Training Datasets | Ready-to-run commands for all Nemotron datasets | \[[Link](#ready-to-run-tokenization-commands)\] |
+
+The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`).
+Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset.
+The tokenization scripts below print the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts.
+
+**Important Notes:**
+
+- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences.
+- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation.
+- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long.
+
+## From JSONL files
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32 \
+    --append_eod
+```
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths /path/to/sft_data.jsonl \
+    --json_keys messages \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32
+```
+
+Instead of `--jsonl_paths`, pass `--input_dir /path/to/dir` to tokenize all JSONL files in a directory (`.jsonl` and `.jsonl.gz` are both supported).
+
+## From Hugging Face Hub
+
+To tokenize a dataset directly from Hugging Face Hub:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
+    --hf_name Nemotron-SFT-Code \
+    --hf_split train \
+    --hf_max_samples_per_split 10_000_000 \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32 \
+    --append_eod
+```
+
+Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
+To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
+
+For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.
+
+> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
+> Re-runs read from cache and are much faster.
+> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.
+
+## `reasoning_content` for Post-Training v3 Datasets
+
+v3 datasets include a `reasoning_content` field in assistant messages (chain-of-thought separate from
+the final answer). Use `--reasoning_content` to control how it is handled:
+
+| Value | Behaviour |
+| --- | --- |
+| `strip` (default) | Field is discarded before `apply_chat_template`. Safe for any tokenizer. |
+| `inline` | Wrapped as `<think>…</think>` and prepended to `content`. Preserves reasoning in a tokenizer-agnostic way. |
+| `native` | Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). |
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Math-v2 \
+    --hf_split high_part00 \
+    --json_keys messages \
+    --tokenizer nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
+    --output_dir tokenized_nemotron_v2 \
+    --workers 32 \
+    --reasoning_content inline
+```
+
+---
+
+## Ready-to-run tokenization commands
+
+Tokenization commands for all Nemotron Pre-Training and Post-Training datasets used in Megatron-Bridge distillation experiments.
+
+Two parameters vary by model — set them before running the commands below:
+
+```bash
+TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2        # HuggingFace tokenizer (or local path)
+OUTPUT_DIR=tokenized_nemotron_v2                   # Output directory for tokenized files
+```
+
+> [!TIP]
+> Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.
+
+> [!NOTE]
+> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
+>
+> You may tokenize more datasets or skip some datasets depending on your needs.
+
+### Nemotron Pretraining dataset
+
+**[nvidia/Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)** — raw text; omitting `--hf_name` tokenizes all 3 subsets (Code, General, MATH) in one command, producing a separate output file per subset named after each:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+  --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
+  --hf_split train \
+  --hf_streaming \
+  --hf_max_samples_per_split 10_000_000 \
+  --json_keys text \
+  --tokenizer ${TOKENIZER} \
+  --output_dir ${OUTPUT_DIR} \
+  --workers 96 \
+  --max_sequence_length 256_000 \
+  --append_eod \
+  --strip_newlines
+```
+
+---
+
+### Nemotron Post-training v1 dataset
+
+**[nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)** — STEM subset, capped at 5M samples. v1 data does not contain reasoning traces:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+  --hf_dataset nvidia/Nemotron-Post-Training-Dataset-v1 \
+  --hf_name default \
+  --hf_split stem \
+  --hf_streaming \
+  --hf_max_samples_per_split 5_000_000 \
+  --json_keys messages \
+  --tokenizer ${TOKENIZER} \
+  --output_dir ${OUTPUT_DIR} \
+  --workers 96 \
+  --max_sequence_length 256_000
+```
+
+---
+
+### Nemotron Post-training v3 collection
+
+Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `<think>…</think>` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.
+
+**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately:
+
+```bash
+for SPLIT in high_part00 high_part01; do
+  python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Math-v2 \
+    --hf_split ${SPLIT} \
+    --json_keys messages \
+    --tokenizer ${TOKENIZER} \
+    --output_dir ${OUTPUT_DIR} \
+    --workers 96 \
+    --max_sequence_length 256_000 \
+    --reasoning_content inline
+done
+```
+
+**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
+
+```bash
+hf download nvidia/Nemotron-SFT-Competitive-Programming-v2 \
+    --repo-type dataset \
+    --local-dir datasets/Nemotron-SFT-Competitive-Programming-v2/
+for FILE in competitive_programming_python_00 competitive_programming_cpp_00; do
+  python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths datasets/Nemotron-SFT-Competitive-Programming-v2/data/${FILE}.jsonl \
+    --json_keys messages \
+    --tokenizer ${TOKENIZER} \
+    --output_dir ${OUTPUT_DIR} \
+    --workers 96 \
+    --max_sequence_length 256_000 \
+    --reasoning_content inline
+done
+```
+
+**[nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1)** — stored as raw JSONL on HuggingFace, download before tokenizing:
+
+```bash
+hf download nvidia/Nemotron-Science-v1 \
+    --repo-type dataset \
+    --local-dir datasets/Nemotron-Science-v1/
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --input_dir datasets/Nemotron-Science-v1/data/ \
+    --json_keys messages \
+    --tokenizer ${TOKENIZER} \
+    --output_dir ${OUTPUT_DIR} \
+    --workers 96 \
+    --max_sequence_length 256_000 \
+    --reasoning_content inline
+```
+
+**[nvidia/Nemotron-SFT-Instruction-Following-Chat-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Instruction-Following-Chat-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
+
+```bash
+hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \
+    --repo-type dataset \
+    --local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \
+    --json_keys messages \
+    --tokenizer ${TOKENIZER} \
+    --output_dir ${OUTPUT_DIR} \
+    --workers 96 \
+    --max_sequence_length 256_000 \
+    --reasoning_content inline
+```
+
+---
+
+### Expected output
+
+After running all commands above, `${OUTPUT_DIR}/` should contain the following `.bin` / `.idx` file pairs:
+
+```text
+nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000.{bin,idx}
+nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000.{bin,idx}
+nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bin,idx}
+nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
+nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
+nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
+competitive_programming_python_00_messages.{bin,idx}
+competitive_programming_cpp_00_messages.{bin,idx}
+MCQ_messages.{bin,idx}
+RQA_messages.{bin,idx}
+reasoning_off_messages.{bin,idx}
+reasoning_on_messages.{bin,idx}
+```