diff --git a/examples/dataset/MEGATRON_DATA_PREP.md b/examples/dataset/MEGATRON_DATA_PREP.md
index c3904d2a0f..7d9ad60e79 100644
--- a/examples/dataset/MEGATRON_DATA_PREP.md
+++ b/examples/dataset/MEGATRON_DATA_PREP.md
@@ -97,8 +97,8 @@ Tokenization commands for all Nemotron Pre-Training and Post-Training datasets u
Two parameters vary by model — set them before running the commands below:
```bash
-TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path)
-OUTPUT_DIR=tokenized_nemotron_v2 # Output directory for tokenized files
+TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # HuggingFace tokenizer (or local path)
+OUTPUT_DIR=tokenized_nemotron_3 # Output directory for tokenized files
```
> [!TIP]
@@ -154,13 +154,14 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `…` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.
-**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately:
+**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately. `--hf_streaming` is required because the messages contain extra fields (e.g. `tool_calls`) that cause Arrow type-cast errors in non-streaming mode when using tokenizers with complex chat templates (such as Nemotron v3):
```bash
for SPLIT in high_part00 high_part01; do
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Math-v2 \
--hf_split ${SPLIT} \
+ --hf_streaming \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
@@ -170,6 +171,26 @@ for SPLIT in high_part00 high_part01; do
done
```
+**[nvidia/Nemotron-SFT-Math-v3](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Math-v3)** — stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested `tool_calls` fields):
+
+```bash
+hf download nvidia/Nemotron-SFT-Math-v3 \
+ --repo-type dataset \
+ --local-dir datasets/Nemotron-SFT-Math-v3/
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+ --jsonl_paths datasets/Nemotron-SFT-Math-v3/data/train.jsonl \
+ --json_keys messages \
+ --tokenizer ${TOKENIZER} \
+ --output_dir ${OUTPUT_DIR} \
+ --workers 96 \
+ --max_sequence_length 256_000 \
+ --reasoning_content inline
+
+# Rename to avoid generic file name
+mv ${OUTPUT_DIR}/train_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.bin
+mv ${OUTPUT_DIR}/train_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.idx
+```
+
**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
```bash
@@ -233,6 +254,7 @@ nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bi
nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
+nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
competitive_programming_python_00_messages.{bin,idx}
competitive_programming_cpp_00_messages.{bin,idx}
MCQ_messages.{bin,idx}
diff --git a/examples/pruning/README.md b/examples/pruning/README.md
index 3f0e4c3e33..31e5ff6eee 100644
--- a/examples/pruning/README.md
+++ b/examples/pruning/README.md
@@ -307,6 +307,7 @@ After pruning, distillation is required to recover model accuracy. Below are rec
End-to-end distillation results with Megatron-Bridge after Minitron and Puzzletron pruning:
- **[Minitron — Nemotron-Nano-9B-v2](minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md)**: End-to-end tutorial of structured pruning for Nemotron-Nano-9B-v2 to 7B followed by knowledge distillation up to 80B tokens, quantization, and vLLM deployment. Achieves near-parity with the official 9B model across popular pretraining and reasoning benchmarks.
+- **[Minitron — Nemotron-3-Nano-30B-A3B-BF16](minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md)**: End-to-end tutorial of structured pruning for Nemotron-3-Nano-30B-A3B-BF16 (31.6B/A3.6B) to 22B/A3.0B active parameters followed by knowledge distillation up to 100B tokens, quantization, and vLLM deployment. Achieves near-parity with the official 30B model across popular pretraining and reasoning benchmarks.
- **[Puzzletron — Qwen3-8B and Llama-3.1-8B-Instruct](puzzletron/Llama-3.1-8B-Instruct.md)**: MIP-based compression followed by short distillation runs on WikiText-103. Shows MMLU recovery and illustrates the importance of using larger datasets to avoid overfitting.
## Resources
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
new file mode 100644
index 0000000000..c36ad43a9d
--- /dev/null
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
@@ -0,0 +1,193 @@
+# Ablations: Nemotron-3-Nano-30B-A3B-BF16
+
+## Pruning
+
+> [!NOTE]
+> The search space analysis below is specific to Nemotron hybrid (Mamba + Attention + MoE) models. Standard transformers expose only layers/hidden/attention/FFN dimensions, but these models add Mamba-specific dimensions (`mamba_num_heads`, `mamba_head_dim`) and MoE dimensions (`num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`). The resulting search space is significantly larger, and the default 40% width / 20% depth constraints include many dead-zone architectures that waste scoring compute. The tighter constraints recommended here were derived from this analysis.
+
+- Score function: `mmlu_10pct_bs32` (zero-shot MMLU on 10% subset, no distillation applied)
+- Candidates analyzed: ~50 across multiple NAS runs, all with `--prune_target_active_params 3e9`
+- Random baseline (5-way MMLU): ~0.25
+- Search space note: `num_attention_heads` was skipped in all runs.
+
+### Key Findings Summary
+
+| Dimension | Good range | Avoid | Key finding |
+| --- | --- | --- | --- |
+| `num_layers` | ≥ 48 | ≤ 46 | 42L fails universally; 46L avg MMLU 0.261 |
+| `mamba_state_dim` | ≥ 3072 (56×56 or 64×64) | < 3072; asymmetric pairs (56×64) | Symmetric reduction — both heads and head_dim must shrink together |
+| `hidden_size` | 2304–2560 | 2688 (original); 2048 | Original hidden_size is actively harmful when other dims are pruned |
+| MoE dims | experts 96–128; shared 3072–3712 | experts = 88 | Weak independent signal after controlling for dominant dims |
+
+---
+
+### Original Model Dimensions
+
+Params: 31.6B, Active: 3.6B
+
+| Dimension | Value |
+| --- | --- |
+| `num_hidden_layers` | 52 |
+| `hidden_size` | 2688 |
+| `mamba_num_heads` | 64 |
+| `mamba_head_dim` | 64 |
+| `num_moe_experts` | 128 |
+| `moe_ffn_hidden_size` | 1856 |
+| `moe_shared_expert_intermediate_size` | 3712 |
+
+---
+
+### Top Candidates (best seen across all runs)
+
+All candidates below have `active_params = 3.00B`.
+
+| Score | Layers | Hidden | Heads | HeadDim | Experts | FFN | Shared | Total Parameters |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| **0.4783** | 52 | 2304 | 64 | 64 | 104 | 1856 | 3072 | 22.3B |
+| **0.4727** | 52 | 2560 | 64 | 64 | 80 | 1280 | 3712 | 14.2B |
+| **0.4650** | 48 | 2560 | 56 | 56 | 112 | 1792 | 3072 | 25.4B |
+| **0.4608** | 52 | 2304 | 64 | 64 | 96 | 1856 | 3072 | 20.7B |
+| **0.4329** | 48 | 2560 | 56 | 56 | 104 | 1792 | 3072 | 23.7B |
+| 0.4119 | 50 | 2304 | 64 | 64 | 80 | 1856 | 3712 | 17.6B |
+| 0.3762 | 52 | 2560 | 48 | 64 | 96 | 1536 | 3712 | 19.3B |
+
+Two architecture families dominate the top results — see [Design Recipe](#design-recipe) below.
+
+---
+
+### Dimension Sensitivity
+
+Sensitivity is ranked by strength of signal across all candidates.
+
+#### 1. `mamba_state_dim` = `mamba_num_heads × mamba_head_dim` — strongest width signal
+
+
+mamba_state_dim sensitivity: threshold ≥3072, best families 56×56 and 64×64 (click to expand)
+
+Analyzing num_heads and head_dim jointly is more predictive than either alone:
+
+| state_dim | Formula | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- | --- |
+| 1920 | 48×40 | 0.254 | 0.257 | 2 |
+| 2304 | 48×48 | 0.249 | 0.260 | 8 |
+| 2688 | 48×56 or 56×48 | 0.264 | 0.310 | 7 |
+| 3072 | 48×64 | 0.342 | 0.376 | 3 |
+| **3136** | **56×56** | **0.400** | **0.465** | 4 |
+| 3584 | 56×64 or 64×56 | 0.261 | 0.340 | 9 |
+| **4096** | **64×64** | **0.399** | **0.478** | 7 |
+
+**Threshold:** `state_dim ≥ 3072` to escape the near-random zone. Below 3072, no candidate has ever exceeded 0.31 regardless of other settings. The two reliable good families are symmetric configurations: **56×56 = 3136** and **64×64 = 4096** (original).
+
+**Why 3584 has a poor average despite being large:** All 3584 candidates in the data are at 46L (depth-limited). The low average is a depth confound, not an inherent failure of 3584.
+
+**Asymmetric reductions hurt:** Reducing only one of {num_heads, head_dim} while keeping the other at 64 (giving 3584) performs worse than symmetric reduction of both to 56 (giving 3136). The 56×56 pattern is consistently more reliable.
+
+**`head_dim=48` is uniformly bad:** Across all candidates with head_dim=48, every single one scored 0.240–0.260. This holds across varying layers (48L–52L) and hidden sizes. head_dim=48 is the effective lower bound under 30% width pruning and it is never viable.
+
+
+
+---
+
+#### 2. `num_layers` — hard lower bound
+
+
+num_layers sensitivity: hard floor at 48L, 42L universally fails (click to expand)
+
+| Layers | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- |
+| 42 | 0.232 | 0.234 | 7 |
+| 46 | 0.261 | 0.340 | 12 |
+| 48 | 0.409 | 0.465 | 4 |
+| 50 | 0.257 | 0.412 | 9 |
+| 52 | 0.336 | 0.478 | 19 |
+
+**42L is a universal failure** — 7/7 candidates at 42L scored near-random, with no other dimension able to compensate. Eliminated by the 15% depth constraint.
+
+**46L is still suboptimal** — avg 0.261, no candidate above 0.340. The effective floor for good performance is **48L**. The 15% depth constraint (min 45L) is correct but 46L candidates still appear in results and are reliably mediocre.
+
+**50L avg is pulled down by head_dim=48 candidates** — when controlling for head_dim≥56, 50L performs comparably to 52L. The 50L failure is a head_dim confound, not a genuine depth issue.
+
+
+
+---
+
+#### 3. `hidden_size` — bad at both extremes, joint constraint with depth
+
+
+hidden_size sensitivity: 2304–2560 good, original 2688 consistently bad (click to expand)
+
+| hidden_size | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- |
+| 2304 | 0.445 | 0.478 | 4 |
+| 2560 | 0.307 | 0.473 | 26 |
+| **2688** | **0.258** | 0.268 | 10 |
+
+**`hidden_size=2688` (the original) is definitively bad in pruned configurations.** This is confirmed by candidates at 52L with hidden=2688 — sufficient depth — all scoring 0.255–0.260. It is not a depth confound. Keeping hidden size un-pruned while reducing other dimensions means the active param budget is consumed inefficiently, leaving too little capacity in the MoE and Mamba layers.
+
+**`hidden_size=2304` requires `num_layers ≥ 48`.** When paired with 42L, it scores 0.232. Paired with 48–52L, it produces the best candidates. This is a joint constraint.
+
+**`hidden_size=2048`** (seen in runs without an active param constraint): consistently undershoots the 3B active target, capping MMLU at ~0.30. Not a viable option.
+
+
+
+---
+
+#### 4. `num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size` — weak independent signal
+
+
+MoE dimension sensitivity: weak signal, experts=88 bad, shared 3072–3712 preferred (click to expand)
+
+No strong monotonic pattern after controlling for the dominant dimensions above.
+
+- `experts=88`: consistently bad (avg 0.260)
+- `moe_shared_expert_intermediate_size` in 3072–3712: preferred; 2560 and 3328 tend to be worse
+- `moe_ffn_hidden_size=1280`: produced the most parameter-efficient good architecture (0.4727, 14.15B total) but is a 31% reduction from the original 1856 — just outside the 30% width constraint. Use **32% width** to include this family if total-param efficiency matters.
+
+
+
+---
+
+### Pruning Constraint Recommendations
+
+```bash
+--max_depth_pruning 0.15 # eliminates 42L dead zone; no good arch needs >7 layers removed
+--max_width_pruning 0.30 # eliminates head_dim≤40, shared=2560; 0.32 to also include ffn=1280
+--prune_target_params 28e9 # not very critical constraint as active params are the primary target
+--prune_target_active_params 3e9 # required; omitting this causes active params to undershoot, capping MMLU at ~0.30
+--hparams_to_skip num_attention_heads
+```
+
+**Remaining dead zones within 15%/30% search space** — the following are still reachable by the NAS but consistently fail:
+
+- `hidden_size=2688` — all 4 candidates at 52L scored 0.255–0.260; consider hardcoding min hidden to 2304
+- `num_layers=46` — avg 0.261 across 12 candidates, none above 0.340
+- `mamba_head_dim=48` — all 7 candidates scored 0.240–0.260; consider hardcoding min head_dim to 56
+
+---
+
+### Design Recipe
+
+Two confirmed high-quality architecture families based on all candidates:
+
+**Family 1 — Best MMLU** (`hidden=2304`, `state_dim=4096`):
+
+```text
+52L | hidden=2304 | mamba_num_heads=64 | mamba_head_dim=64 | num_moe_experts=96–104 | moe_ffn_hidden_size=1856 | shared=3072
+active=3.00B, total=20.7–22.3B, MMLU=0.461–0.478
+```
+
+**Family 2 — Good MMLU, larger total params** (`hidden=2560`, `state_dim=3136`):
+
+```text
+48L | hidden=2560 | mamba_num_heads=56 | mamba_head_dim=56 | num_moe_experts=104–112 | moe_ffn_hidden_size=1792 | shared=3072
+active=3.00B, total=23.7–25.4B, MMLU=0.433–0.465
+```
+
+**Required conditions for any good candidate:**
+
+| Condition | Threshold | Failure rate below threshold |
+| --- | --- | --- |
+| `num_layers` | ≥ 48 | 42L: 7/7 fail; 46L: 12/12 below 0.35 |
+| `mamba_state_dim` | ≥ 3072 | 0/24 candidates below 3072 exceed 0.31 |
+| `hidden_size` | 2304–2560 | 2688: 10/10 below 0.27 |
+| `mamba_head_dim` | ≥ 56 | 15/15 candidates with head_dim≤48 score 0.24–0.26 |
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
new file mode 100644
index 0000000000..bca89c71de
--- /dev/null
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
@@ -0,0 +1,378 @@
+# Nemotron-3-Nano-30B-A3B: Prune + Distill + Quantize + vLLM Deployment
+
+End-to-end optimization of [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) demonstrating how ModelOpt techniques stack: Minitron structured pruning → Megatron-Bridge knowledge distillation to recover accuracy → FP8 quantization → vLLM deployment and throughput benchmarking. This document covers:
+
+1. **[Data Preparation](#1-data-preparation)** — tokenizing the training blend for distillation
+2. **[Pruning](#2-pruning)** — Minitron structured pruning
+3. **[Distillation](#3-distillation)** — recovering accuracy via Megatron-Bridge knowledge distillation
+4. **[Evaluation](#4-evaluation)** — benchmarking with NeMo Evaluator across MMLU Pro, GPQA Diamond, AIME, and more
+5. **[Quantization](#5-quantization)** — FP8 PTQ on the distilled checkpoint using ModelOpt's `examples/llm_ptq/hf_ptq.py` script
+6. **[vLLM Inference Benchmarking](#6-vllm-inference-benchmarking)** — throughput comparison of BF16 vs FP8 on a single H100
+
+**Environment:** Container `nvcr.io/nvidia/nemo:26.04`, ModelOpt 0.45.0. See the [Megatron-Bridge README](../../../megatron_bridge/README.md) for environment setup (including ModelOpt mount path) and container usage.
+
+## Results
+
+
+
+| Model | MMLU | MMLU Pro | GPQA Diamond | LiveCodeBench v6 | AIME 2025 | IFBench | SciCode (Subtask) | Average |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Pruned 22B/A3.0B (no distillation) | 53.4 | 47.1 | 33.5 | 27.4 | 15.5 | 37.2 | 11.4 | 32.2 |
+| Distill @ 2.5B tokens (100 iters) | 68.6 | 73.6 | 62.5 | 57.5 | 79.1 | 58.0 | 21.6 | 60.1 |
+| Distill @ 20B tokens (800 iters) | 70.8 | 74.6 | 65.3 | 61.0 | 79.8 | 63.5 | 21.2 | 62.3 |
+| Distill @ 40B tokens (1600 iters) | 71.6 | 75.7 | 64.5 | 61.6 | 76.8 | 67.2 | 27.0 | 63.5 |
+| Distill @ 60B tokens (2400 iters) | 71.5 | 76.0 | 67.5 | 63.0 | 77.5 | 68.0 | 28.5 | 64.6 |
+| Distill @ 80B tokens (3200 iters) | 71.7 | 76.5 | 68.4 | 64.2 | 80.2 | 66.1 | 27.0 | 64.9 |
+| Distill @ 100B tokens (4000 iters) | 71.8 | 76.6 | 68.4 | 64.5 | 81.0 | 68.5 | 26.8 | 65.4 |
+| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (official, 31.6B/A3.6B) | 73.5 | 78.0 | 70.3 | 67.9 | 87.1 | 68.9 | 33.6 | 68.5 |
+
+> [!NOTE]
+> Exact numbers may vary depending on deployment and evaluation setup. All models above (including the official model) were evaluated once with the same [evaluation setup](#4-evaluation) for fair comparison. These numbers may differ from those reported on the official [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) HuggingFace model card.
+
+---
+
+## Steps to Reproduce
+
+### 1. Data Preparation
+
+See [examples/dataset/MEGATRON_DATA_PREP.md](../../../dataset/MEGATRON_DATA_PREP.md) for tokenization commands for all datasets used in this blend.
+
+For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`, `OUTPUT_DIR=tokenized_nemotron_3`.
+
+> [!NOTE]
+> Compared to experiments in [NVIDIA-Nemotron-Nano-9B-v2](../NVIDIA-Nemotron-Nano-9B-v2/README.md), we use `Nemotron-SFT-Math-v3` instead of `Nemotron-Math-v2 / high_part01` since it is higher quality with full reasoning traces.
+
+#### Data Blend
+
+**30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)**
+
+| Dataset | Tokens | Weight | Notes |
+| ----------------------------------------------------- | ------ | ------ | ---------------------------------------------- |
+| Nemotron-Pretraining-SFT-v1 / Code (10M samples) | 7B | 5 | Pretraining code |
+| Nemotron-Pretraining-SFT-v1 / General (10M samples) | 16B | 20 | Upweighted to close MMLU gap |
+| Nemotron-Pretraining-SFT-v1 / MATH (10M samples) | 13B | 5 | Pretraining math |
+| Nemotron-Math-v2 / high_part00 | 13B | 10 | Hard math reasoning |
+| Nemotron-SFT-Math-v3 / train | 52B | 20 | Hard math reasoning with full reasoning traces |
+| Nemotron-SFT-Competitive-Programming-v2 / python_00 | 7B | 15 | Python reasoning traces |
+| Nemotron-SFT-Competitive-Programming-v2 / cpp_00 | 7B | 5 | C++ reasoning traces |
+| Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 22B | 10 | Broad STEM |
+| Nemotron-Science-v1 / MCQ | 0.5B | 3 | GPQA MCQ format alignment |
+| Nemotron-Science-v1 / RQA | 0.3B | 2 | GPQA format diversity |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_on | 2B | 3 | Instruction following (thinking on) |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_off | 1B | 2 | Instruction following (thinking off) |
+
+
+Data blend for distillation (click to expand)
+
+```bash
+DATA_BLEND=" \
+5 tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
+20 tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
+5 tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
+10 tokenized_nemotron_3/nvidia--Nemotron-Math-v2_default_high_part00_messages \
+20 tokenized_nemotron_3/nvidia--Nemotron-SFT-Math-v3_default_train_messages \
+15 tokenized_nemotron_3/competitive_programming_python_00_messages \
+5 tokenized_nemotron_3/competitive_programming_cpp_00_messages \
+10 tokenized_nemotron_3/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
+3 tokenized_nemotron_3/MCQ_messages \
+2 tokenized_nemotron_3/RQA_messages \
+3 tokenized_nemotron_3/reasoning_on_messages \
+2 tokenized_nemotron_3/reasoning_off_messages \
+"
+```
+
+
+
+#### General Guidelines
+
+The optimal blend is 30% pretraining and 70% post-training data. Exact proportions may vary depending on the benchmarks you care about. The blend above was designed to maximize recovery on popular General Knowledge, Reasoning, Instruction Following, and Tool Calling benchmarks. The key design decisions were:
+
+- **30% pretraining data** closes the MMLU gap that arises from training exclusively on reasoning-heavy post-training data. The General split (20%) is upweighted specifically to recover general knowledge recall.
+- **Math (30%)** is the largest post-training category because AIME and MMLU Pro respond strongly to more math reasoning tokens. We use a mix of `Nemotron-Math-v2` and `Nemotron-SFT-Math-v3` for higher quality math reasoning signal with full reasoning traces.
+- **Science (15%)** uses `Nemotron-Post-Training-Dataset-v1 / stem` as the primary source for volume and GPQA stability, with small allocations to `Nemotron-Science-v1` MCQ/RQA subsets for format alignment with GPQA's multiple-choice structure.
+- **Instruction following (5%)** saturates quickly so a small allocation is sufficient.
+
+This blend intentionally omits capabilities not targeted in this experiment (e.g. long context and multilingual benchmarks). Depending on what benchmarks matter for your use case, you can substitute or add datasets from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3), for example:
+
+| Capability | Relevant datasets |
+| --- | --- |
+| Multilingual | `Nemotron-SFT-Multilingual-v1` |
+| Agentic / tool use | `Nemotron-SFT-Tool-Call-v1`, `Nemotron-SFT-Tool-Call-v2` |
+| Software engineering (SWE) | `Nemotron-SFT-SWE-v2` |
+| Safety / alignment | `Nemotron-SFT-Safety-v1` |
+| Long context | `Nemotron-SFT-Long-Context-v1` |
+
+When adding new datasets, reduce weights of lower-priority categories proportionally to keep the total at 100%.
+
+---
+
+### 2. Pruning
+
+Here we prune the model from 31.6B/A3.6B to 3.0B active parameters.
+
+Run on **1 node with 8x H100** (~1 hour)
+
+
+Pruning command (click to expand)
+
+```bash
+torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
+ --pp_size 8 \
+ --hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+ --trust_remote_code \
+ --prune_target_params 28e9 \
+ --prune_target_active_params 3e9 \
+ --hparams_to_skip num_attention_heads \
+ --seq_length 8192 \
+ --output_hf_path /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+ --top_k 20 \
+ --max_depth_pruning 0.15 \
+ --max_width_pruning 0.30 \
+ --prune_score_func mmlu_10pct_bs32 \
+ --num_layers_in_first_pipeline_stage 5 \
+ --num_layers_in_last_pipeline_stage 5
+```
+
+Non-default arguments:
+
+- `--hparams_to_skip num_attention_heads` (default: none) — attention heads pruning is harder to recover, hence skipped
+- `--seq_length 8192` (default: 4096) — dataset has longer sequences
+- `--prune_target_active_params 3e9` — MoE-specific; the **primary** pruning constraint — targets active params rather than total params, which is what matters for MoE inference cost
+- `--prune_target_params 28e9` — upper bound on total params only; the actual pruned model total can range anywhere from ~20B to 28B depending on which architecture wins — see pruning logs below for the top 20 candidates. You may also skip this argument all together for simplicity.
+- `--top_k 20` (default: 10) — larger candidate pool for better architecture search
+- `--max_depth_pruning 0.15` (default: 0.20) — tighter constraint since candidates with 42–46 layers universally fail for this model
+- `--max_width_pruning 0.30` (default: 0.40) — tighter constraint to prevent head_dim≤48 and hidden=2048 dead zones
+- `--prune_score_func mmlu_10pct_bs32` (default: `mmlu_10pct_bs1`) — batch_size=32 for ~3–4× faster candidate scoring
+- `--num_layers_in_first_pipeline_stage 5 --num_layers_in_last_pipeline_stage 5` — Uneven pipeline parallelism since 52 layers is not divisible by 8 GPUs
+
+**NOTE**: The tighter search space constraints here (`--max_depth_pruning`, `--max_width_pruning`) are specific to Nemotron hybrid models (Mamba + Attention + MoE). Unlike standard transformers which expose only layers/hidden/attention/FFN dimensions, these models add Mamba-specific dimensions (`mamba_num_heads`, `mamba_head_dim`) and MoE dimensions (`num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`), making the combined search space much larger. The default 40%/20% bounds cast too wide a net and waste compute on dead-zone architectures.
+
+See [ABLATIONS.md](ABLATIONS.md#pruning) for the full architecture search analysis across various candidates.
+
+
+
+Pruning logs (top 20 candidates, best subnet, layer patterns) (click to expand)
+
+```text
+╭──────────────────────────────────────────────────── Original Model Stats ─────────────────────────────────────────────────────╮
+│ Total Parameters 31.58B │
+│ Active Parameters 3.58B │
+│ Memory (BF16, seq_length=8192, batch_size=1) weights: 60230.1 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 60301.9 MB │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+
+ Search Space
+ (≤30% width / ≤15% depth pruning)
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Hyperparameter ┃ Choices ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ num_layers │ [46, 48, 50, 52] │
+│ hidden_size │ [2048, 2304, 2560, 2688] │
+│ mamba_num_heads │ [48, 56, 64] │
+│ mamba_head_dim │ [48, 56, 64] │
+│ num_moe_experts │ [96, 104, 112, 120, 128] │
+│ moe_ffn_hidden_size │ [1536, 1792, 1856] │
+│ moe_shared_expert_intermediate_size │ [2816, 3072, 3328, 3584, 3712] │
+├─────────────────────────────────────┼────────────────────────────────┤
+│ Search space size │ 10800 │
+└─────────────────────────────────────┴────────────────────────────────┘
+
+Top 20 Candidates with Scores
+┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃ # ┃ export_config ┃ active_params ┃ params ┃ score ┃
+┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│ 1 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 120, │ 3.00B │ 27.06B │ 0.3399 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 2 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.4650 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 3 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2343 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 4 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 56, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2552 │
+│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 5 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 21.61B │ 0.2601 │
+│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 6 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 19.28B │ 0.3762 │
+│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
+│ 7 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 22.28B │ 0.4783 │
+│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 8 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 96, │ 3.00B │ 21.99B │ 0.2420 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
+│ 9 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2399 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
+│ 10 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112, │ 3.00B │ 26.17B │ 0.2601 │
+│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
+│ 11 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 112, │ 3.00B │ 25.37B │ 0.2503 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 12 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.4329 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 13 │ {'num_layers': 46, 'hidden_size': 2688, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 128, │ 3.00B │ 26.17B │ 0.2587 │
+│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 2816} │ │ │ │
+│ 14 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2336 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 15 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 96, │ 3.00B │ 20.09B │ 0.2559 │
+│ │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 16 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 96, │ 3.00B │ 20.70B │ 0.4608 │
+│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+│ 17 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2455 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
+│ 18 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104, │ 3.00B │ 24.42B │ 0.2503 │
+│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328} │ │ │ │
+│ 19 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 120, │ 3.00B │ 27.92B │ 0.2587 │
+│ │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3712} │ │ │ │
+│ 20 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 104, │ 3.00B │ 23.68B │ 0.2469 │
+│ │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072} │ │ │ │
+└────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────┴────────┘
+
+╭──────────────────────────────────────────────────────────────────────── Best Subnet ─────────────────────────────────────────────────────────────────────────╮
+│ export_config {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, 'moe_ffn_hidden_size': 1856, │
+│ 'moe_shared_expert_intermediate_size': 3072} │
+│ active_params 3.00B │
+│ params 22.28B │
+│ score 0.4783 │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+
+Original hybrid_layer_pattern: MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
+Pruned hybrid_layer_pattern: MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
+
+╭───────────────────────────────────────────────────── Pruned Model Stats ──────────────────────────────────────────────────────╮
+│ Total Parameters 22.28B │
+│ Active Parameters 3.00B │
+│ Memory (BF16, seq_length=8192, batch_size=1) weights: 42489.7 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 42561.6 MB │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+
+
+
+> [!TIP]
+> Here we skip the Knowledge Distillation (KD) step for candidates for simplicity. If you want to find a better pruned model, you can take few top candidates' `export_config` from the logs above (where score is in similar range as the best subnet) and then export all models separately and perform KD for ~2B tokens on each of them before selecting the best subnet based on your desired metrics.
+
+> [!NOTE]
+> Copy the `nano_v3_reasoning_parser.py` file from the original HuggingFace checkpoint to the pruned model for evaluation with tool-calling below.
+
+---
+
+### 3. Distillation
+
+Minimum hardware: **4 nodes × 8x H100 (32 GPUs)** — required by `TP=4 × EP=8`. On **96 nodes × 8x H100 (768 GPUs total)**, it takes ~900 H100 GPU-hours per 10B tokens (400 iters), i.e. ~70 min wall-clock per 10B tokens on 96 nodes. Full 100B token run (4k steps) takes ~9k H100 GPU-hours (~12 hours wall-clock).
+
+
+Distillation command (click to expand)
+
+```bash
+python -u /opt/Model-Optimizer/examples/megatron_bridge/distill.py \
+ --teacher_hf_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+ --student_hf_path /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+ --trust_remote_code \
+ --tp_size 4 \
+ --pp_size 1 \
+ --ep_size 8 \
+ --etp_size 1 \
+ --data_paths "${DATA_BLEND}" \
+ --data_path_to_cache /path/to/cache \
+ --seq_length 8192 \
+ --mbs 1 \
+ --gbs 3072 \
+ --train_iters 4000 \
+ --lr 1e-4 \
+ --min_lr 1e-5 \
+ --lr_warmup_iters 25 \
+ --eval_interval 200 \
+ --eval_iters 8 \
+ --log_interval 10 \
+ --output_dir /path/to/distill_output
+
+# Optional: Weights & Biases logging
+# --wandb_project \
+# --wandb_entity \
+# --wandb_exp_name
+```
+
+Non-default arguments:
+
+- `--seq_length 8192` (default: 4096)
+- `--gbs 3072` (default: 768) — matches the original Nemotron-3-Nano-30B training GBS from the paper, kept to preserve the training distribution
+- `--train_iters 4000` — ~100B tokens; can stop earlier and take intermediate checkpoints
+- `--lr_warmup_iters 25` (default: 50)
+- `--eval_interval 200` (default: 100) — less frequent eval to save compute
+- `--eval_iters 8` (default: 32) - since GBS is 4× larger than default
+
+All other arguments use defaults.
+
+
+For multi-node Slurm runs, see the [Megatron-Bridge README](../../../megatron_bridge/README.md#slurm-usage) for details.
+
+Distillation saves checkpoints in Megatron distributed format under `/checkpoints/iter_XXXXXXX`. You can convert any intermediate checkpoint to HuggingFace format using the Megatron-Bridge conversion script (see [Megatron Bridge README](../../../megatron_bridge/README.md) for full details):
+
+```bash
+python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
+ --hf-model /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+ --megatron-path /checkpoints/iter_ \
+ --hf-path /checkpoints/hf_iter_
+```
+
+> [!NOTE]
+> This is pure SFT-style distillation — no RL or online reward signal is used. Adding an RL-based post-training step after distillation is a natural next step that could further improve some of these benchmarks.
+
+---
+
+### 4. Evaluation
+
+The eval config in [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it. For local model execution and evaluation, refer to the [NeMo Evaluator documentation](https://docs.nvidia.com/nemo/evaluator/latest/) or this [blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe).
+
+Before running, update the following fields in the yaml or overwrite them in the command line with `-o