Skip to content
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5d50c73
Add release-cherry-pick Claude Code skill (#1352)
kevalmorabia97 Apr 27, 2026
57eb6b7
chore: Move FP8 MHA quantization entry from 0.45 to 0.44 in CHANGELOG…
ajrasane Apr 27, 2026
816da0f
fix incomplete mapping of safetensors in generated puzzletron checkpo…
grzegorz-k-karch Apr 27, 2026
8abe394
[NVBUG: 6103846] Fix nvfp4_awq export for uncalibrated MoE experts (#…
cjluo-nv Apr 27, 2026
3383a20
Fix regex capture for Megatron KD PP layer renaming (#1355)
AAnoosheh Apr 27, 2026
34d554f
Add required keys to attention pruning config (#1360)
grzegorz-k-karch Apr 28, 2026
484acbf
Support EP mcore import for TE Spec and Fix mamba moe config (#1342)
jenchen13 Apr 28, 2026
e906fb6
[NVBug 6102977] Add _disable_use_cache context manager to fix PTQ Att…
meenchen Apr 28, 2026
82a856d
[NVBug 6108145] Fix PTQ calibration and export for fused-experts MoE …
meenchen Apr 29, 2026
f06190b
[BUG6108338] Update windows documentation for onnxruntime quantizatio…
ynankani Apr 29, 2026
4879789
[Fix]: Relax Dflash Rregression Test Threshold fo 2GPUs (#1373)
h-guo18 Apr 29, 2026
1aa5775
Ensure removal of temp files on error in ONNX INT4 quantization (#1359)
vishalpandya1990 Apr 30, 2026
e88857d
[6034518] Remove return statement preventing remote auto tuning (#1361)
dthienan-nv Apr 30, 2026
bc89749
Add Nemotron-Nano-9B-v2 → Pruned 7B e2e tutorial: Prune + Distill + E…
kevalmorabia97 Apr 30, 2026
0f9ef85
Added fallback to load extra cudnn dlls in the site packages (#1369)
hthadicherla May 4, 2026
d3d519d
fix: include medusa in data_module assignment in main.py (#1370)
yeyu-nvidia May 4, 2026
1b2f029
fix: guard against None chat_template in _post_process_chat_template …
yeyu-nvidia May 4, 2026
3720a7a
Increase gpu_tests CI timeout from 60 to 75 mins
kevalmorabia97 May 4, 2026
1f619ce
Fix sparsity-only export emitting empty hf_quant_config.json (#1375)
kaix-nv May 4, 2026
c07841a
Enable Python 3.14 wheel support to unblock NGC PyTorch container tes…
kevalmorabia97 May 4, 2026
55c338f
[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export (#1353)
ajrasane May 5, 2026
fc24813
[6106576] Restore llm_export_utils as deprecated shim for edgellm 0.6…
ajrasane May 5, 2026
e3d3321
Fix gpt-oss examples trl import error (#1390)
sugunav14 May 5, 2026
6b9f370
Remove test_dflash_offline.py regression test
kevalmorabia97 May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions .claude/skills/release-cherry-pick/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
name: release-cherry-pick
description: Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".
---

# Cherry-pick PRs to a Release Branch

Cherry-pick all merged `main` PRs labeled `cherry-pick-<version>` (but not `cherry-pick-done`) into the corresponding `release/<version>` branch, one by one in merge order.

## Step 1 — Identify the target version

Ask the user for the release version (e.g. `0.44.0`) if not already provided.

Set `VERSION=<version>` for use in subsequent steps.

## Step 2 — Fetch pending PRs

Use the GitHub search API to list PRs that have the cherry-pick label but not cherry-pick-done, sorted by merge date ascending:

```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
Comment on lines +20 to +24
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle pagination for repositories with many pending cherry-picks.

The GitHub API query uses per_page=50, which limits results to 50 PRs. If more than 50 PRs are labeled cherry-pick-<VERSION> without cherry-pick-done, later PRs will be silently omitted from the cherry-pick batch.

📄 Suggested approaches

Option 1: Increase the page size to a safer limit:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \

Option 2: Add pagination using --paginate:

-gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
+gh api --paginate "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=100" \
   --jq '.items[] | [.number, .title, .pull_request.merged_at] | `@tsv`' \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
gh api "search/issues?q=repo:NVIDIA/Model-Optimizer+is:pr+is:merged+base:main+label:cherry-pick-<VERSION>+-label:cherry-pick-done&sort=updated&order=asc&per_page=50" \
--jq '.items[] | [.number, .title, .pull_request.merged_at] | @tsv' \
| sort -t$'\t' -k3
```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 20 - 24, The gh api
command in the SKILL.md snippet uses per_page=50 which will truncate results
when more than 50 matching PRs exist; update the command that constructs the
search (the gh api "search/issues?q=...+per_page=50" invocation) to handle
pagination—either increase per_page to the GitHub max (100) and/or add
--paginate to the gh api call so all pages are returned, and ensure the pipeline
(the jq and sort usage) still consumes streamed results correctly.


Present the list to the user before proceeding.

## Step 3 — Set up the release branch

Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:

```bash
git fetch origin release/<VERSION>
git checkout release/<VERSION>
```
Comment on lines +28 to +35
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify release branch creation behavior.

The description states "creating it from the remote if it doesn't exist locally," but the provided commands will fail if the release/<VERSION> branch doesn't exist on the remote. The commands only check out an existing remote branch—they don't create it.

📝 Suggested clarification

Either update the description to reflect the actual behavior:

-Check out `release/<VERSION>`, creating it from the remote if it doesn't exist locally:
+Check out the existing `release/<VERSION>` branch from the remote:

Or provide commands that handle both cases:

+If the release branch doesn't exist yet, create it from main:
+
+```bash
+git fetch origin
+if git rev-parse --verify origin/release/<VERSION> >/dev/null 2>&1; then
+  git checkout release/<VERSION>
+else
+  git checkout -b release/<VERSION> origin/main
+  git push -u origin release/<VERSION>
+fi
+```
+
+Otherwise, check out the existing branch:
+
 ```bash
 git fetch origin release/<VERSION>
 git checkout release/<VERSION>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/release-cherry-pick/SKILL.md around lines 28 - 35, The
current text says "creating it from the remote if it doesn't exist locally" but
the shown git fetch/checkout will fail if release/ doesn't exist on the
remote; update the Step 3 wording to reflect that, or replace the commands with
logic that first git fetch origin, then test for origin/release/ (e.g.,
using git rev-parse --verify origin/release/) and if it exists run git
checkout release/, otherwise create the branch locally from main (git
checkout -b release/ origin/main) and push it upstream (git push -u
origin release/).


</details>

<!-- fingerprinting:phantom:triton:puma -->

<!-- 4e71b3a2 -->

<!-- This is an auto-generated comment by CodeRabbit -->


## Step 4 — Get merge commit SHAs

All PRs are squash-merged, so each has a single-parent commit. Retrieve the SHA for each PR:

```bash
gh pr view <NUM> --repo NVIDIA/Model-Optimizer --json mergeCommit --jq '.mergeCommit.oid'
```

## Step 5 — Cherry-pick in merge order

Cherry-pick each commit with `-s` (DCO sign-off). GPG signing is handled automatically by the repo's git config.

```bash
git cherry-pick -s <SHA>
```

**On conflict:** Tell the user which PR caused the conflict and ask them to fix it, then continue:

```bash
git cherry-pick --continue
```

## Step 6 — Create a PR to the release branch

Push the cherry-picks to a new branch and open a PR targeting `release/<VERSION>`. The PR title lists every cherry-picked PR number. The body uses `## Cherry-picked PRs` as the only heading with one `- #<NUM>` bullet per PR — no titles, no links, no extra text.

```bash
git checkout -B cherry-picks/release-<VERSION>
git push -u origin cherry-picks/release-<VERSION>

gh pr create \
--title "[Cherry-pick] PRs #<NUM1> #<NUM2> ..." \
--base release/<VERSION> \
--head cherry-picks/release-<VERSION> \
--body "$(cat <<'EOF'
## Cherry-picked PRs

- #<NUM1>
- #<NUM2>
...
EOF
)"
```

## Step 7 — Apply cherry-pick-done label

Add the `cherry-pick-done` label to every PR that was successfully cherry-picked:

```bash
for pr in <NUM1> <NUM2> ...; do
gh pr edit $pr --repo NVIDIA/Model-Optimizer --add-label "cherry-pick-done"
done
```
3 changes: 3 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Changelog
**New Features**

- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
- Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See `examples/pruning/minitron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning/minitron/>`_ for details.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
- Add N:M sparse softmax support to the Triton flash attention kernel (``modelopt.torch.kernels.triton_fa``). See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
Expand All @@ -17,6 +18,7 @@ Changelog
- [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
- Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-none_kv_gptq.yaml>`_ for usage.
- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
- Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.

**Backward Breaking Changes**

Expand All @@ -29,6 +31,7 @@ Changelog
- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
- Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the ``trtexec`` version instead of the TRT Python API when using ``trtexec`` backend.
- Exclude MatMul/Gemm nodes with K or N < 16 from ONNX INT8 and FP8 quantization. Such small-dimension GEMMs cannot efficiently use INT8/FP8 Tensor Cores and the added Q/DQ layers cause perf regressions in TensorRT. Honors Gemm ``transB`` when deriving K.
- Fix ``nvfp4_awq`` export ``AssertionError: Modules have different quantization formats`` for MoE models (e.g. Qwen3-30B-A3B) when some experts are not exercised by the calibration data. ``awq_lite`` now applies a neutral all-ones ``pre_quant_scale`` to any expert that ends up disabled (no cache-pass tokens, NaN scales, or no search-pass tokens) so its format remains ``nvfp4_awq``, consistent with the rest of the MoE block. A warning is emitted whenever this fallback fires.

**Misc**

Expand Down
16 changes: 16 additions & 0 deletions docs/source/getting_started/windows/_installation_standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,22 @@ If you need to use any other EP for calibration, you can uninstall the existing

By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ tool for GPU acceleration during the INT4 ONNX quantization process. This is compatible with CUDA 12.x.

If you are using CUDA 13.x, update CUDA-dependent packages manually:

For official ONNX Runtime guidance, see `Nightly builds for CUDA 13.x <https://onnxruntime.ai/docs/install/#nightly-for-cuda-13x>`_.

1. Uninstall ``cupy-cuda12x`` and install ``cupy-cuda13x``.
2. Uninstall ``onnxruntime-genai-cuda`` and ``onnxruntime-gpu``.
3. Install ONNX Runtime CUDA 13 nightly and the pre-release ``onnxruntime-genai-cuda`` package.

.. code-block:: bash

pip uninstall -y cupy-cuda12x onnxruntime-genai-cuda onnxruntime-gpu
pip install cupy-cuda13x
pip install coloredlogs flatbuffers numpy packaging protobuf sympy
pip install --pre --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-13-nightly/pypi/simple/ onnxruntime-gpu
pip install --pre onnxruntime-genai-cuda

**6. Verify Installation**

Ensure the following steps are verified:
Expand Down
242 changes: 242 additions & 0 deletions examples/dataset/MEGATRON_DATA_PREP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# Tokenizing for Megatron Frameworks

| **Section** | **Description** | **Link** |
| :---: | :---: | :---: |
| From JSONL files | Tokenize local JSONL files | \[[Link](#from-jsonl-files)\] |
| From Hugging Face Hub | Stream or download HF datasets and tokenize | \[[Link](#from-hugging-face-hub)\] |
| `reasoning_content` for Post-Training v3 | Control how chain-of-thought traces are handled | \[[Link](#reasoning_content-for-post-training-v3-datasets)\] |
| Nemotron Pre/Post-Training Datasets | Ready-to-run commands for all Nemotron datasets | \[[Link](#ready-to-run-tokenization-commands)\] |

The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`).
Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset.
The tokenization scripts below print the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts.

**Important Notes:**

- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences.
- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation.
- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long.

## From JSONL files

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32 \
--append_eod
```

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/sft_data.jsonl \
--json_keys messages \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32
```

Instead of `--jsonl_paths`, pass `--input_dir /path/to/dir` to tokenize all JSONL files in a directory (`.jsonl` and `.jsonl.gz` are both supported).

## From Hugging Face Hub

To tokenize a dataset directly from Hugging Face Hub:

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
--hf_name Nemotron-SFT-Code \
--hf_split train \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32 \
--append_eod
```

Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).

For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.

> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
> Re-runs read from cache and are much faster.
> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.

## `reasoning_content` for Post-Training v3 Datasets

v3 datasets include a `reasoning_content` field in assistant messages (chain-of-thought separate from
the final answer). Use `--reasoning_content` to control how it is handled:

| Value | Behaviour |
| --- | --- |
| `strip` (default) | Field is discarded before `apply_chat_template`. Safe for any tokenizer. |
| `inline` | Wrapped as `<think>…</think>` and prepended to `content`. Preserves reasoning in a tokenizer-agnostic way. |
| `native` | Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). |

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Math-v2 \
--hf_split high_part00 \
--json_keys messages \
--tokenizer nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
--output_dir tokenized_nemotron_v2 \
--workers 32 \
--reasoning_content inline
```

---

## Ready-to-run tokenization commands

Tokenization commands for all Nemotron Pre-Training and Post-Training datasets used in Megatron-Bridge distillation experiments.

Two parameters vary by model — set them before running the commands below:

```bash
TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
OUTPUT_DIR=tokenized_nemotron_v2 # Output directory for tokenized files
```

> [!TIP]
> Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.

> [!NOTE]
> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
>
> You may tokenize more datasets or skip some datasets depending on your needs.

### Nemotron Pretraining dataset

**[nvidia/Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)** — raw text; omitting `--hf_name` tokenizes all 3 subsets (Code, General, MATH) in one command, producing a separate output file per subset named after each:

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
--hf_split train \
--hf_streaming \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--append_eod \
--strip_newlines
```

---

### Nemotron Post-training v1 dataset

**[nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)** — STEM subset, capped at 5M samples. v1 data does not contain reasoning traces:

```bash
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Post-Training-Dataset-v1 \
--hf_name default \
--hf_split stem \
--hf_streaming \
--hf_max_samples_per_split 5_000_000 \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000
```

---

### Nemotron Post-training v3 collection

Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `<think>…</think>` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.

**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately:

```bash
for SPLIT in high_part00 high_part01; do
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Math-v2 \
--hf_split ${SPLIT} \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
done
```

**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:

```bash
hf download nvidia/Nemotron-SFT-Competitive-Programming-v2 \
--repo-type dataset \
--local-dir datasets/Nemotron-SFT-Competitive-Programming-v2/
for FILE in competitive_programming_python_00 competitive_programming_cpp_00; do
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths datasets/Nemotron-SFT-Competitive-Programming-v2/data/${FILE}.jsonl \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
done
```

**[nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1)** — stored as raw JSONL on HuggingFace, download before tokenizing:

```bash
hf download nvidia/Nemotron-Science-v1 \
--repo-type dataset \
--local-dir datasets/Nemotron-Science-v1/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--input_dir datasets/Nemotron-Science-v1/data/ \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
```

**[nvidia/Nemotron-SFT-Instruction-Following-Chat-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Instruction-Following-Chat-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:

```bash
hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \
--repo-type dataset \
--local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
```

---

### Expected output

After running all commands above, `${OUTPUT_DIR}/` should contain the following `.bin` / `.idx` file pairs:

```text
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
competitive_programming_python_00_messages.{bin,idx}
competitive_programming_cpp_00_messages.{bin,idx}
MCQ_messages.{bin,idx}
RQA_messages.{bin,idx}
reasoning_off_messages.{bin,idx}
reasoning_on_messages.{bin,idx}
```
Loading
Loading