diff --git a/examples/dataset/MEGATRON_DATA_PREP.md b/examples/dataset/MEGATRON_DATA_PREP.md
index c3904d2a0f..7d9ad60e79 100644
--- a/examples/dataset/MEGATRON_DATA_PREP.md
+++ b/examples/dataset/MEGATRON_DATA_PREP.md
@@ -97,8 +97,8 @@ Tokenization commands for all Nemotron Pre-Training and Post-Training datasets u
 Two parameters vary by model — set them before running the commands below:
 
 ```bash
-TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2        # HuggingFace tokenizer (or local path)
-OUTPUT_DIR=tokenized_nemotron_v2                   # Output directory for tokenized files
+TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # HuggingFace tokenizer (or local path)
+OUTPUT_DIR=tokenized_nemotron_3                      # Output directory for tokenized files
 ```
 
 > [!TIP]
@@ -154,13 +154,14 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
 
 Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `<think>…</think>` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.
 
-**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately:
+**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately. `--hf_streaming` is required because the messages contain extra fields (e.g. `tool_calls`) that cause Arrow type-cast errors in non-streaming mode when using tokenizers with complex chat templates (such as Nemotron v3):
 
 ```bash
 for SPLIT in high_part00 high_part01; do
   python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
     --hf_dataset nvidia/Nemotron-Math-v2 \
     --hf_split ${SPLIT} \
+    --hf_streaming \
     --json_keys messages \
     --tokenizer ${TOKENIZER} \
     --output_dir ${OUTPUT_DIR} \
@@ -170,6 +171,26 @@ for SPLIT in high_part00 high_part01; do
 done
 ```
 
+**[nvidia/Nemotron-SFT-Math-v3](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Math-v3)** — stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested `tool_calls` fields):
+
+```bash
+hf download nvidia/Nemotron-SFT-Math-v3 \
+    --repo-type dataset \
+    --local-dir datasets/Nemotron-SFT-Math-v3/
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths datasets/Nemotron-SFT-Math-v3/data/train.jsonl \
+    --json_keys messages \
+    --tokenizer ${TOKENIZER} \
+    --output_dir ${OUTPUT_DIR} \
+    --workers 96 \
+    --max_sequence_length 256_000 \
+    --reasoning_content inline
+
+# Rename to avoid generic file name
+mv ${OUTPUT_DIR}/train_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.bin
+mv ${OUTPUT_DIR}/train_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.idx
+```
+
 **[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing:
 
 ```bash
@@ -233,6 +254,7 @@ nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bi
 nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
 nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
 nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
+nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
 competitive_programming_python_00_messages.{bin,idx}
 competitive_programming_cpp_00_messages.{bin,idx}
 MCQ_messages.{bin,idx}
diff --git a/examples/pruning/README.md b/examples/pruning/README.md
index 3f0e4c3e33..31e5ff6eee 100644
--- a/examples/pruning/README.md
+++ b/examples/pruning/README.md
@@ -307,6 +307,7 @@ After pruning, distillation is required to recover model accuracy. Below are rec
 End-to-end distillation results with Megatron-Bridge after Minitron and Puzzletron pruning:
 
 - **[Minitron — Nemotron-Nano-9B-v2](minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md)**: End-to-end tutorial of structured pruning for Nemotron-Nano-9B-v2 to 7B followed by knowledge distillation up to 80B tokens, quantization, and vLLM deployment. Achieves near-parity with the official 9B model across popular pretraining and reasoning benchmarks.
+- **[Minitron — Nemotron-3-Nano-30B-A3B-BF16](minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md)**: End-to-end tutorial of structured pruning for Nemotron-3-Nano-30B-A3B-BF16 (31.6B/A3.6B) to 22B/A3.0B active parameters followed by knowledge distillation up to 100B tokens, quantization, and vLLM deployment. Achieves near-parity with the official 30B model across popular pretraining and reasoning benchmarks.
 - **[Puzzletron — Qwen3-8B and Llama-3.1-8B-Instruct](puzzletron/Llama-3.1-8B-Instruct.md)**: MIP-based compression followed by short distillation runs on WikiText-103. Shows MMLU recovery and illustrates the importance of using larger datasets to avoid overfitting.
 
 ## Resources
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
new file mode 100644
index 0000000000..c36ad43a9d
--- /dev/null
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.md
@@ -0,0 +1,193 @@
+# Ablations: Nemotron-3-Nano-30B-A3B-BF16
+
+## Pruning
+
+> [!NOTE]
+> The search space analysis below is specific to Nemotron hybrid (Mamba + Attention + MoE) models. Standard transformers expose only layers/hidden/attention/FFN dimensions, but these models add Mamba-specific dimensions (`mamba_num_heads`, `mamba_head_dim`) and MoE dimensions (`num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`). The resulting search space is significantly larger, and the default 40% width / 20% depth constraints include many dead-zone architectures that waste scoring compute. The tighter constraints recommended here were derived from this analysis.
+
+- Score function: `mmlu_10pct_bs32` (zero-shot MMLU on 10% subset, no distillation applied)
+- Candidates analyzed: ~50 across multiple NAS runs, all with `--prune_target_active_params 3e9`
+- Random baseline (5-way MMLU): ~0.25
+- Search space note: `num_attention_heads` was skipped in all runs.
+
+### Key Findings Summary
+
+| Dimension | Good range | Avoid | Key finding |
+| --- | --- | --- | --- |
+| `num_layers` | ≥ 48 | ≤ 46 | 42L fails universally; 46L avg MMLU 0.261 |
+| `mamba_state_dim` | ≥ 3072 (56×56 or 64×64) | < 3072; asymmetric pairs (56×64) | Symmetric reduction — both heads and head_dim must shrink together |
+| `hidden_size` | 2304–2560 | 2688 (original); 2048 | Original hidden_size is actively harmful when other dims are pruned |
+| MoE dims | experts 96–128; shared 3072–3712 | experts = 88 | Weak independent signal after controlling for dominant dims |
+
+---
+
+### Original Model Dimensions
+
+Params: 31.6B, Active: 3.6B
+
+| Dimension | Value |
+| --- | --- |
+| `num_hidden_layers` | 52 |
+| `hidden_size` | 2688 |
+| `mamba_num_heads` | 64 |
+| `mamba_head_dim` | 64 |
+| `num_moe_experts` | 128 |
+| `moe_ffn_hidden_size` | 1856 |
+| `moe_shared_expert_intermediate_size` | 3712 |
+
+---
+
+### Top Candidates (best seen across all runs)
+
+All candidates below have `active_params = 3.00B`.
+
+| Score | Layers | Hidden | Heads | HeadDim | Experts | FFN | Shared | Total Parameters |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| **0.4783** | 52 | 2304 | 64 | 64 | 104 | 1856 | 3072 | 22.3B |
+| **0.4727** | 52 | 2560 | 64 | 64 | 80 | 1280 | 3712 | 14.2B |
+| **0.4650** | 48 | 2560 | 56 | 56 | 112 | 1792 | 3072 | 25.4B |
+| **0.4608** | 52 | 2304 | 64 | 64 | 96 | 1856 | 3072 | 20.7B |
+| **0.4329** | 48 | 2560 | 56 | 56 | 104 | 1792 | 3072 | 23.7B |
+| 0.4119 | 50 | 2304 | 64 | 64 | 80 | 1856 | 3712 | 17.6B |
+| 0.3762 | 52 | 2560 | 48 | 64 | 96 | 1536 | 3712 | 19.3B |
+
+Two architecture families dominate the top results — see [Design Recipe](#design-recipe) below.
+
+---
+
+### Dimension Sensitivity
+
+Sensitivity is ranked by strength of signal across all candidates.
+
+#### 1. `mamba_state_dim` = `mamba_num_heads × mamba_head_dim` — strongest width signal
+
+<details>
+<summary>mamba_state_dim sensitivity: threshold ≥3072, best families 56×56 and 64×64 (click to expand)</summary>
+
+Analyzing num_heads and head_dim jointly is more predictive than either alone:
+
+| state_dim | Formula | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- | --- |
+| 1920 | 48×40 | 0.254 | 0.257 | 2 |
+| 2304 | 48×48 | 0.249 | 0.260 | 8 |
+| 2688 | 48×56 or 56×48 | 0.264 | 0.310 | 7 |
+| 3072 | 48×64 | 0.342 | 0.376 | 3 |
+| **3136** | **56×56** | **0.400** | **0.465** | 4 |
+| 3584 | 56×64 or 64×56 | 0.261 | 0.340 | 9 |
+| **4096** | **64×64** | **0.399** | **0.478** | 7 |
+
+**Threshold:** `state_dim ≥ 3072` to escape the near-random zone. Below 3072, no candidate has ever exceeded 0.31 regardless of other settings. The two reliable good families are symmetric configurations: **56×56 = 3136** and **64×64 = 4096** (original).
+
+**Why 3584 has a poor average despite being large:** All 3584 candidates in the data are at 46L (depth-limited). The low average is a depth confound, not an inherent failure of 3584.
+
+**Asymmetric reductions hurt:** Reducing only one of {num_heads, head_dim} while keeping the other at 64 (giving 3584) performs worse than symmetric reduction of both to 56 (giving 3136). The 56×56 pattern is consistently more reliable.
+
+**`head_dim=48` is uniformly bad:** Across all candidates with head_dim=48, every single one scored 0.240–0.260. This holds across varying layers (48L–52L) and hidden sizes. head_dim=48 is the effective lower bound under 30% width pruning and it is never viable.
+
+</details>
+
+---
+
+#### 2. `num_layers` — hard lower bound
+
+<details>
+<summary>num_layers sensitivity: hard floor at 48L, 42L universally fails (click to expand)</summary>
+
+| Layers | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- |
+| 42 | 0.232 | 0.234 | 7 |
+| 46 | 0.261 | 0.340 | 12 |
+| 48 | 0.409 | 0.465 | 4 |
+| 50 | 0.257 | 0.412 | 9 |
+| 52 | 0.336 | 0.478 | 19 |
+
+**42L is a universal failure** — 7/7 candidates at 42L scored near-random, with no other dimension able to compensate. Eliminated by the 15% depth constraint.
+
+**46L is still suboptimal** — avg 0.261, no candidate above 0.340. The effective floor for good performance is **48L**. The 15% depth constraint (min 45L) is correct but 46L candidates still appear in results and are reliably mediocre.
+
+**50L avg is pulled down by head_dim=48 candidates** — when controlling for head_dim≥56, 50L performs comparably to 52L. The 50L failure is a head_dim confound, not a genuine depth issue.
+
+</details>
+
+---
+
+#### 3. `hidden_size` — bad at both extremes, joint constraint with depth
+
+<details>
+<summary>hidden_size sensitivity: 2304–2560 good, original 2688 consistently bad (click to expand)</summary>
+
+| hidden_size | Avg MMLU | Max MMLU | n |
+| --- | --- | --- | --- |
+| 2304 | 0.445 | 0.478 | 4 |
+| 2560 | 0.307 | 0.473 | 26 |
+| **2688** | **0.258** | 0.268 | 10 |
+
+**`hidden_size=2688` (the original) is definitively bad in pruned configurations.** This is confirmed by candidates at 52L with hidden=2688 — sufficient depth — all scoring 0.255–0.260. It is not a depth confound. Keeping hidden size un-pruned while reducing other dimensions means the active param budget is consumed inefficiently, leaving too little capacity in the MoE and Mamba layers.
+
+**`hidden_size=2304` requires `num_layers ≥ 48`.** When paired with 42L, it scores 0.232. Paired with 48–52L, it produces the best candidates. This is a joint constraint.
+
+**`hidden_size=2048`** (seen in runs without an active param constraint): consistently undershoots the 3B active target, capping MMLU at ~0.30. Not a viable option.
+
+</details>
+
+---
+
+#### 4. `num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size` — weak independent signal
+
+<details>
+<summary>MoE dimension sensitivity: weak signal, experts=88 bad, shared 3072–3712 preferred (click to expand)</summary>
+
+No strong monotonic pattern after controlling for the dominant dimensions above.
+
+- `experts=88`: consistently bad (avg 0.260)
+- `moe_shared_expert_intermediate_size` in 3072–3712: preferred; 2560 and 3328 tend to be worse
+- `moe_ffn_hidden_size=1280`: produced the most parameter-efficient good architecture (0.4727, 14.15B total) but is a 31% reduction from the original 1856 — just outside the 30% width constraint. Use **32% width** to include this family if total-param efficiency matters.
+
+</details>
+
+---
+
+### Pruning Constraint Recommendations
+
+```bash
+--max_depth_pruning 0.15          # eliminates 42L dead zone; no good arch needs >7 layers removed
+--max_width_pruning 0.30          # eliminates head_dim≤40, shared=2560; 0.32 to also include ffn=1280
+--prune_target_params 28e9        # not very critical constraint as active params are the primary target
+--prune_target_active_params 3e9  # required; omitting this causes active params to undershoot, capping MMLU at ~0.30
+--hparams_to_skip num_attention_heads
+```
+
+**Remaining dead zones within 15%/30% search space** — the following are still reachable by the NAS but consistently fail:
+
+- `hidden_size=2688` — all 4 candidates at 52L scored 0.255–0.260; consider hardcoding min hidden to 2304
+- `num_layers=46` — avg 0.261 across 12 candidates, none above 0.340
+- `mamba_head_dim=48` — all 7 candidates scored 0.240–0.260; consider hardcoding min head_dim to 56
+
+---
+
+### Design Recipe
+
+Two confirmed high-quality architecture families based on all candidates:
+
+**Family 1 — Best MMLU** (`hidden=2304`, `state_dim=4096`):
+
+```text
+52L | hidden=2304 | mamba_num_heads=64 | mamba_head_dim=64 | num_moe_experts=96–104 | moe_ffn_hidden_size=1856 | shared=3072
+active=3.00B, total=20.7–22.3B, MMLU=0.461–0.478
+```
+
+**Family 2 — Good MMLU, larger total params** (`hidden=2560`, `state_dim=3136`):
+
+```text
+48L | hidden=2560 | mamba_num_heads=56 | mamba_head_dim=56 | num_moe_experts=104–112 | moe_ffn_hidden_size=1792 | shared=3072
+active=3.00B, total=23.7–25.4B, MMLU=0.433–0.465
+```
+
+**Required conditions for any good candidate:**
+
+| Condition | Threshold | Failure rate below threshold |
+| --- | --- | --- |
+| `num_layers` | ≥ 48 | 42L: 7/7 fail; 46L: 12/12 below 0.35 |
+| `mamba_state_dim` | ≥ 3072 | 0/24 candidates below 3072 exceed 0.31 |
+| `hidden_size` | 2304–2560 | 2688: 10/10 below 0.27 |
+| `mamba_head_dim` | ≥ 56 | 15/15 candidates with head_dim≤48 score 0.24–0.26 |
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
new file mode 100644
index 0000000000..bca89c71de
--- /dev/null
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.md
@@ -0,0 +1,378 @@
+# Nemotron-3-Nano-30B-A3B: Prune + Distill + Quantize + vLLM Deployment
+
+End-to-end optimization of [NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) demonstrating how ModelOpt techniques stack: Minitron structured pruning → Megatron-Bridge knowledge distillation to recover accuracy → FP8 quantization → vLLM deployment and throughput benchmarking. This document covers:
+
+1. **[Data Preparation](#1-data-preparation)** — tokenizing the training blend for distillation
+2. **[Pruning](#2-pruning)** — Minitron structured pruning
+3. **[Distillation](#3-distillation)** — recovering accuracy via Megatron-Bridge knowledge distillation
+4. **[Evaluation](#4-evaluation)** — benchmarking with NeMo Evaluator across MMLU Pro, GPQA Diamond, AIME, and more
+5. **[Quantization](#5-quantization)** — FP8 PTQ on the distilled checkpoint using ModelOpt's `examples/llm_ptq/hf_ptq.py` script
+6. **[vLLM Inference Benchmarking](#6-vllm-inference-benchmarking)** — throughput comparison of BF16 vs FP8 on a single H100
+
+**Environment:** Container `nvcr.io/nvidia/nemo:26.04`, ModelOpt 0.45.0. See the [Megatron-Bridge README](../../../megatron_bridge/README.md) for environment setup (including ModelOpt mount path) and container usage.
+
+## Results
+
+![Benchmark Recovery During Knowledge Distillation](figures/learning_curves.png)
+
+| Model | MMLU | MMLU Pro | GPQA Diamond | LiveCodeBench v6 | AIME 2025 | IFBench | SciCode (Subtask) | Average |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| Pruned 22B/A3.0B (no distillation) | 53.4 | 47.1 | 33.5 | 27.4 | 15.5 | 37.2 | 11.4 | 32.2 |
+| Distill @ 2.5B tokens (100 iters) | 68.6 | 73.6 | 62.5 | 57.5 | 79.1 | 58.0 | 21.6 | 60.1 |
+| Distill @  20B tokens (800 iters) | 70.8 | 74.6 | 65.3 | 61.0 | 79.8 | 63.5 | 21.2 | 62.3 |
+| Distill @  40B tokens (1600 iters) | 71.6 | 75.7 | 64.5 | 61.6 | 76.8 | 67.2 | 27.0 | 63.5 |
+| Distill @  60B tokens (2400 iters) | 71.5 | 76.0 | 67.5 | 63.0 | 77.5 | 68.0 | 28.5 | 64.6 |
+| Distill @  80B tokens (3200 iters) | 71.7 | 76.5 | 68.4 | 64.2 | 80.2 | 66.1 | 27.0 | 64.9 |
+| Distill @ 100B tokens (4000 iters) | 71.8 | 76.6 | 68.4 | 64.5 | 81.0 | 68.5 | 26.8 | 65.4 |
+| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (official, 31.6B/A3.6B) | 73.5 | 78.0 | 70.3 | 67.9 | 87.1 | 68.9 | 33.6 | 68.5 |
+
+> [!NOTE]
+> Exact numbers may vary depending on deployment and evaluation setup. All models above (including the official model) were evaluated once with the same [evaluation setup](#4-evaluation) for fair comparison. These numbers may differ from those reported on the official [Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) HuggingFace model card.
+
+---
+
+## Steps to Reproduce
+
+### 1. Data Preparation
+
+See [examples/dataset/MEGATRON_DATA_PREP.md](../../../dataset/MEGATRON_DATA_PREP.md) for tokenization commands for all datasets used in this blend.
+
+For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`, `OUTPUT_DIR=tokenized_nemotron_3`.
+
+> [!NOTE]
+> Compared to experiments in [NVIDIA-Nemotron-Nano-9B-v2](../NVIDIA-Nemotron-Nano-9B-v2/README.md), we use `Nemotron-SFT-Math-v3` instead of `Nemotron-Math-v2 / high_part01` since it is higher quality with full reasoning traces.
+
+#### Data Blend
+
+**30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)**
+
+| Dataset                                               | Tokens | Weight | Notes                                          |
+| ----------------------------------------------------- | ------ | ------ | ---------------------------------------------- |
+| Nemotron-Pretraining-SFT-v1 / Code (10M samples)      | 7B     | 5      | Pretraining code                               |
+| Nemotron-Pretraining-SFT-v1 / General (10M samples)   | 16B    | 20     | Upweighted to close MMLU gap                   |
+| Nemotron-Pretraining-SFT-v1 / MATH (10M samples)      | 13B    | 5      | Pretraining math                               |
+| Nemotron-Math-v2 / high_part00                        | 13B    | 10     | Hard math reasoning                            |
+| Nemotron-SFT-Math-v3 / train                          | 52B    | 20     | Hard math reasoning with full reasoning traces |
+| Nemotron-SFT-Competitive-Programming-v2 / python_00   | 7B     | 15     | Python reasoning traces                        |
+| Nemotron-SFT-Competitive-Programming-v2 / cpp_00      | 7B     | 5      | C++ reasoning traces                           |
+| Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 22B    | 10     | Broad STEM                                     |
+| Nemotron-Science-v1 / MCQ                             | 0.5B   | 3      | GPQA MCQ format alignment                      |
+| Nemotron-Science-v1 / RQA                             | 0.3B   | 2      | GPQA format diversity                          |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_on                | 2B     | 3      | Instruction following (thinking on)            |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_off               | 1B     | 2      | Instruction following (thinking off)           |
+
+<details>
+<summary>Data blend for distillation (click to expand)</summary>
+
+```bash
+DATA_BLEND=" \
+5  tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
+20 tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000 \
+5  tokenized_nemotron_3/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000 \
+10 tokenized_nemotron_3/nvidia--Nemotron-Math-v2_default_high_part00_messages \
+20 tokenized_nemotron_3/nvidia--Nemotron-SFT-Math-v3_default_train_messages \
+15 tokenized_nemotron_3/competitive_programming_python_00_messages \
+5  tokenized_nemotron_3/competitive_programming_cpp_00_messages \
+10 tokenized_nemotron_3/nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000 \
+3  tokenized_nemotron_3/MCQ_messages \
+2  tokenized_nemotron_3/RQA_messages \
+3  tokenized_nemotron_3/reasoning_on_messages \
+2  tokenized_nemotron_3/reasoning_off_messages \
+"
+```
+
+</details>
+
+#### General Guidelines
+
+The optimal blend is 30% pretraining and 70% post-training data. Exact proportions may vary depending on the benchmarks you care about. The blend above was designed to maximize recovery on popular General Knowledge, Reasoning, Instruction Following, and Tool Calling benchmarks. The key design decisions were:
+
+- **30% pretraining data** closes the MMLU gap that arises from training exclusively on reasoning-heavy post-training data. The General split (20%) is upweighted specifically to recover general knowledge recall.
+- **Math (30%)** is the largest post-training category because AIME and MMLU Pro respond strongly to more math reasoning tokens. We use a mix of `Nemotron-Math-v2` and `Nemotron-SFT-Math-v3` for higher quality math reasoning signal with full reasoning traces.
+- **Science (15%)** uses `Nemotron-Post-Training-Dataset-v1 / stem` as the primary source for volume and GPQA stability, with small allocations to `Nemotron-Science-v1` MCQ/RQA subsets for format alignment with GPQA's multiple-choice structure.
+- **Instruction following (5%)** saturates quickly so a small allocation is sufficient.
+
+This blend intentionally omits capabilities not targeted in this experiment (e.g. long context and multilingual benchmarks). Depending on what benchmarks matter for your use case, you can substitute or add datasets from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3), for example:
+
+| Capability | Relevant datasets |
+| --- | --- |
+| Multilingual | `Nemotron-SFT-Multilingual-v1` |
+| Agentic / tool use | `Nemotron-SFT-Tool-Call-v1`, `Nemotron-SFT-Tool-Call-v2` |
+| Software engineering (SWE) | `Nemotron-SFT-SWE-v2` |
+| Safety / alignment | `Nemotron-SFT-Safety-v1` |
+| Long context | `Nemotron-SFT-Long-Context-v1` |
+
+When adding new datasets, reduce weights of lower-priority categories proportionally to keep the total at 100%.
+
+---
+
+### 2. Pruning
+
+Here we prune the model from 31.6B/A3.6B to 3.0B active parameters.
+
+Run on **1 node with 8x H100** (~1 hour)
+
+<details>
+<summary>Pruning command (click to expand)</summary>
+
+```bash
+torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
+  --pp_size 8 \
+  --hf_model_name_or_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+  --trust_remote_code \
+  --prune_target_params 28e9 \
+  --prune_target_active_params 3e9 \
+  --hparams_to_skip num_attention_heads \
+  --seq_length 8192 \
+  --output_hf_path /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+  --top_k 20 \
+  --max_depth_pruning 0.15 \
+  --max_width_pruning 0.30 \
+  --prune_score_func mmlu_10pct_bs32 \
+  --num_layers_in_first_pipeline_stage 5 \
+  --num_layers_in_last_pipeline_stage 5
+```
+
+Non-default arguments:
+
+- `--hparams_to_skip num_attention_heads` (default: none) — attention heads pruning is harder to recover, hence skipped
+- `--seq_length 8192` (default: 4096) — dataset has longer sequences
+- `--prune_target_active_params 3e9` — MoE-specific; the **primary** pruning constraint — targets active params rather than total params, which is what matters for MoE inference cost
+- `--prune_target_params 28e9` — upper bound on total params only; the actual pruned model total can range anywhere from ~20B to 28B depending on which architecture wins — see pruning logs below for the top 20 candidates. You may also skip this argument all together for simplicity.
+- `--top_k 20` (default: 10) — larger candidate pool for better architecture search
+- `--max_depth_pruning 0.15` (default: 0.20) — tighter constraint since candidates with 42–46 layers universally fail for this model
+- `--max_width_pruning 0.30` (default: 0.40) — tighter constraint to prevent head_dim≤48 and hidden=2048 dead zones
+- `--prune_score_func mmlu_10pct_bs32` (default: `mmlu_10pct_bs1`) — batch_size=32 for ~3–4× faster candidate scoring
+- `--num_layers_in_first_pipeline_stage 5 --num_layers_in_last_pipeline_stage 5` — Uneven pipeline parallelism since 52 layers is not divisible by 8 GPUs
+
+**NOTE**: The tighter search space constraints here (`--max_depth_pruning`, `--max_width_pruning`) are specific to Nemotron hybrid models (Mamba + Attention + MoE). Unlike standard transformers which expose only layers/hidden/attention/FFN dimensions, these models add Mamba-specific dimensions (`mamba_num_heads`, `mamba_head_dim`) and MoE dimensions (`num_moe_experts`, `moe_ffn_hidden_size`, `moe_shared_expert_intermediate_size`), making the combined search space much larger. The default 40%/20% bounds cast too wide a net and waste compute on dead-zone architectures.
+
+See [ABLATIONS.md](ABLATIONS.md#pruning) for the full architecture search analysis across various candidates.
+</details>
+
+<details>
+<summary>Pruning logs (top 20 candidates, best subnet, layer patterns) (click to expand)</summary>
+
+```text
+╭──────────────────────────────────────────────────── Original Model Stats ─────────────────────────────────────────────────────╮
+│ Total Parameters                              31.58B                                                                          │
+│ Active Parameters                             3.58B                                                                           │
+│ Memory (BF16, seq_length=8192, batch_size=1)  weights: 60230.1 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 60301.9 MB │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+
+                              Search Space
+                   (≤30% width / ≤15% depth pruning)
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Hyperparameter                      ┃ Choices                        ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ num_layers                          │ [46, 48, 50, 52]               │
+│ hidden_size                         │ [2048, 2304, 2560, 2688]       │
+│ mamba_num_heads                     │ [48, 56, 64]                   │
+│ mamba_head_dim                      │ [48, 56, 64]                   │
+│ num_moe_experts                     │ [96, 104, 112, 120, 128]       │
+│ moe_ffn_hidden_size                 │ [1536, 1792, 1856]             │
+│ moe_shared_expert_intermediate_size │ [2816, 3072, 3328, 3584, 3712] │
+├─────────────────────────────────────┼────────────────────────────────┤
+│ Search space size                   │ 10800                          │
+└─────────────────────────────────────┴────────────────────────────────┘
+
+Top 20 Candidates with Scores
+┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
+┃  # ┃ export_config                                                                                                         ┃ active_params ┃ params ┃  score ┃
+┡━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
+│  1 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 120,          │         3.00B │ 27.06B │ 0.3399 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  2 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 112,          │         3.00B │ 25.37B │ 0.4650 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  3 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 112,          │         3.00B │ 25.37B │ 0.2343 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  4 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 56, 'mamba_head_dim': 48, 'num_moe_experts': 96,           │         3.00B │ 20.09B │ 0.2552 │
+│    │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  5 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 104,          │         3.00B │ 21.61B │ 0.2601 │
+│    │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  6 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 64, 'num_moe_experts': 96,           │         3.00B │ 19.28B │ 0.3762 │
+│    │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3712}                                             │               │        │        │
+│  7 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104,          │         3.00B │ 22.28B │ 0.4783 │
+│    │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│  8 │ {'num_layers': 52, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 96,           │         3.00B │ 21.99B │ 0.2420 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3328}                                             │               │        │        │
+│  9 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112,          │         3.00B │ 25.37B │ 0.2399 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712}                                             │               │        │        │
+│ 10 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 112,          │         3.00B │ 26.17B │ 0.2601 │
+│    │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328}                                             │               │        │        │
+│ 11 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 112,          │         3.00B │ 25.37B │ 0.2503 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│ 12 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 56, 'num_moe_experts': 104,          │         3.00B │ 23.68B │ 0.4329 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│ 13 │ {'num_layers': 46, 'hidden_size': 2688, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 128,          │         3.00B │ 26.17B │ 0.2587 │
+│    │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 2816}                                             │               │        │        │
+│ 14 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 64, 'mamba_head_dim': 56, 'num_moe_experts': 104,          │         3.00B │ 23.68B │ 0.2336 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│ 15 │ {'num_layers': 52, 'hidden_size': 2688, 'mamba_num_heads': 48, 'mamba_head_dim': 56, 'num_moe_experts': 96,           │         3.00B │ 20.09B │ 0.2559 │
+│    │ 'moe_ffn_hidden_size': 1536, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│ 16 │ {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 96,           │         3.00B │ 20.70B │ 0.4608 │
+│    │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+│ 17 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104,          │         3.00B │ 23.68B │ 0.2455 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3712}                                             │               │        │        │
+│ 18 │ {'num_layers': 50, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 104,          │         3.00B │ 24.42B │ 0.2503 │
+│    │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3328}                                             │               │        │        │
+│ 19 │ {'num_layers': 48, 'hidden_size': 2560, 'mamba_num_heads': 48, 'mamba_head_dim': 48, 'num_moe_experts': 120,          │         3.00B │ 27.92B │ 0.2587 │
+│    │ 'moe_ffn_hidden_size': 1856, 'moe_shared_expert_intermediate_size': 3712}                                             │               │        │        │
+│ 20 │ {'num_layers': 46, 'hidden_size': 2560, 'mamba_num_heads': 56, 'mamba_head_dim': 64, 'num_moe_experts': 104,          │         3.00B │ 23.68B │ 0.2469 │
+│    │ 'moe_ffn_hidden_size': 1792, 'moe_shared_expert_intermediate_size': 3072}                                             │               │        │        │
+└────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────┴────────┴────────┘
+
+╭──────────────────────────────────────────────────────────────────────── Best Subnet ─────────────────────────────────────────────────────────────────────────╮
+│ export_config  {'num_layers': 52, 'hidden_size': 2304, 'mamba_num_heads': 64, 'mamba_head_dim': 64, 'num_moe_experts': 104, 'moe_ffn_hidden_size': 1856,     │
+│                'moe_shared_expert_intermediate_size': 3072}                                                                                                  │
+│ active_params  3.00B                                                                                                                                         │
+│ params         22.28B                                                                                                                                        │
+│ score          0.4783                                                                                                                                        │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+
+Original hybrid_layer_pattern: MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
+Pruned hybrid_layer_pattern:   MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME
+
+╭───────────────────────────────────────────────────── Pruned Model Stats ──────────────────────────────────────────────────────╮
+│ Total Parameters                              22.28B                                                                          │
+│ Active Parameters                             3.00B                                                                           │
+│ Memory (BF16, seq_length=8192, batch_size=1)  weights: 42489.7 MB, kv_cache: 48.0 MB, mamba_state: 23.8 MB, Total: 42561.6 MB │
+╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+
+</details>
+
+> [!TIP]
+> Here we skip the Knowledge Distillation (KD) step for candidates for simplicity. If you want to find a better pruned model, you can take few top candidates' `export_config` from the logs above (where score is in similar range as the best subnet) and then export all models separately and perform KD for ~2B tokens on each of them before selecting the best subnet based on your desired metrics.
+
+> [!NOTE]
+> Copy the `nano_v3_reasoning_parser.py` file from the original HuggingFace checkpoint to the pruned model for evaluation with tool-calling below.
+
+---
+
+### 3. Distillation
+
+Minimum hardware: **4 nodes × 8x H100 (32 GPUs)** — required by `TP=4 × EP=8`. On **96 nodes × 8x H100 (768 GPUs total)**, it takes ~900 H100 GPU-hours per 10B tokens (400 iters), i.e. ~70 min wall-clock per 10B tokens on 96 nodes. Full 100B token run (4k steps) takes ~9k H100 GPU-hours (~12 hours wall-clock).
+
+<details>
+<summary>Distillation command (click to expand)</summary>
+
+```bash
+python -u /opt/Model-Optimizer/examples/megatron_bridge/distill.py \
+    --teacher_hf_path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --student_hf_path /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+    --trust_remote_code \
+    --tp_size 4 \
+    --pp_size 1 \
+    --ep_size 8 \
+    --etp_size 1 \
+    --data_paths "${DATA_BLEND}" \
+    --data_path_to_cache /path/to/cache \
+    --seq_length 8192 \
+    --mbs 1 \
+    --gbs 3072 \
+    --train_iters 4000 \
+    --lr 1e-4 \
+    --min_lr 1e-5 \
+    --lr_warmup_iters 25 \
+    --eval_interval 200 \
+    --eval_iters 8 \
+    --log_interval 10 \
+    --output_dir /path/to/distill_output
+
+# Optional: Weights & Biases logging
+#     --wandb_project <wandb_project> \
+#     --wandb_entity <wandb_entity> \
+#     --wandb_exp_name <wandb_exp_name>
+```
+
+Non-default arguments:
+
+- `--seq_length 8192` (default: 4096)
+- `--gbs 3072` (default: 768) — matches the original Nemotron-3-Nano-30B training GBS from the paper, kept to preserve the training distribution
+- `--train_iters 4000` — ~100B tokens; can stop earlier and take intermediate checkpoints
+- `--lr_warmup_iters 25` (default: 50)
+- `--eval_interval 200` (default: 100) — less frequent eval to save compute
+- `--eval_iters 8` (default: 32) - since GBS is 4× larger than default
+
+All other arguments use defaults.
+</details>
+
+For multi-node Slurm runs, see the [Megatron-Bridge README](../../../megatron_bridge/README.md#slurm-usage) for details.
+
+Distillation saves checkpoints in Megatron distributed format under `<output_dir>/checkpoints/iter_XXXXXXX`. You can convert any intermediate checkpoint to HuggingFace format using the Megatron-Bridge conversion script (see [Megatron Bridge README](../../../megatron_bridge/README.md) for full details):
+
+```bash
+python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
+    --hf-model /path/to/Nemotron-3-Nano-30B-A3B-Pruned-A3.0B \
+    --megatron-path <output_dir>/checkpoints/iter_<iter_number> \
+    --hf-path <output_dir>/checkpoints/hf_iter_<iter_number>
+```
+
+> [!NOTE]
+> This is pure SFT-style distillation — no RL or online reward signal is used. Adding an RL-based post-training step after distillation is a natural next step that could further improve some of these benchmarks.
+
+---
+
+### 4. Evaluation
+
+The eval config in [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it. For local model execution and evaluation, refer to the [NeMo Evaluator documentation](https://docs.nvidia.com/nemo/evaluator/latest/) or this [blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe).
+
+Before running, update the following fields in the yaml or overwrite them in the command line with `-o <option>=<value>`:
+
+- `execution.hostname` — your Slurm login node hostname
+- `execution.account` — your Slurm account
+- `deployment.checkpoint_path` — Hugging Face checkpoint path (original, pruned or quantized)
+- `evaluation.nemo_evaluator_config.config.params.extra.tokenizer` — same path as `checkpoint_path`
+
+> [!TIP]
+> Uncomment `limit_samples` under any task to run a small subset and verify the end-to-end eval pipeline before launching full evals.
+
+```bash
+pip install "nemo-evaluator-launcher[all]==0.1.90"
+
+# Set required environment variables:
+export HF_TOKEN=<your_huggingface_token>
+export SLURM_JOB_DIR=<path_to_slurm_job_output_dir>
+export HF_HOME=<path_to_huggingface_cache>
+export VLLM_CACHE_ROOT=<path_to_vllm_cache>
+
+# Set additional unused but required environment variables:
+export API_KEY=xxxxxx
+export INFERENCE_API_KEY=xxxxxx
+export OPENAI_CLIENT_ID=xxxxxx
+export OPENAI_CLIENT_SECRET=xxxxxx
+
+nemo-evaluator-launcher run --config nemo_evaluator.yaml
+```
+
+> [!TIP]
+> Run same evals multiple times to get a more stable result.
+
+**Tasks and exact metric names reported in the results table:**
+
+| Benchmark | Library | Metric name |
+| --- | --- | --- |
+| MMLU | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot) | `mmlu` |
+| MMLU Pro | NeMo Evaluator | `mmlu-pro_pass_at_1_symbolic_correct` |
+| GPQA Diamond | NeMo Evaluator | `gpqa_pass_at_1_symbolic_correct` |
+| LiveCodeBench v6 | NeMo Evaluator | `livecodebench_pass_at_1_accuracy` |
+| AIME 2025 | NeMo Evaluator | `aime25_pass_at_1_symbolic_correct` |
+| IFBench | NeMo Evaluator | `ifbench_pass_at_1_average_score` |
+| SciCode (Subtask) | NeMo Evaluator | `scicode_pass_at_1_subtask_accuracy` |
+
+**Key vLLM settings:** Tool calling is enabled via `--enable-auto-tool-choice --tool-call-parser qwen3_coder`.
+
+For more details on NeMo Evaluator, see the [GitHub repo](https://github.com/NVIDIA-NeMo/evaluator) and [documentation](https://docs.nvidia.com/nemo/evaluator/latest/).
+
+---
+
+### 5. Quantization
+
+TODO
+
+---
+
+### 6. vLLM Inference Benchmarking
+
+TODO
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.png b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.png
new file mode 100644
index 0000000000..de06833061
Binary files /dev/null and b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.png differ
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yaml b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yaml
new file mode 100644
index 0000000000..cacbc07880
--- /dev/null
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yaml
@@ -0,0 +1,181 @@
+# NeMo Evaluator Launcher config for Nemotron-3-Nano-30B-A3B and pruned variants
+# ------------------------------------------------------------------------------
+# Before running, update the following fields in the yaml:
+# - `execution.hostname` — your Slurm login node hostname
+# - `execution.account` — your Slurm account
+# - `deployment.checkpoint_path` — Hugging Face checkpoint path (original, pruned or quantized)
+# - `evaluation.nemo_evaluator_config.config.params.extra.tokenizer` — same path as `checkpoint_path`
+#
+# Usage:
+#   pip install "nemo-evaluator-launcher[all]==0.1.90"
+#
+#   # Set required environment variables:
+#   export HF_TOKEN=<your_huggingface_token>
+#   export SLURM_JOB_DIR=<path_to_slurm_job_output_dir>
+#   export HF_HOME=<path_to_huggingface_cache>
+#   export VLLM_CACHE_ROOT=<path_to_vllm_cache>
+#
+#   # Set additional unused but required environment variables:
+#   export API_KEY=xxxxxx
+#   export INFERENCE_API_KEY=xxxxxx
+#   export OPENAI_CLIENT_ID=xxxxxx
+#   export OPENAI_CLIENT_SECRET=xxxxxx
+#
+#   nemo-evaluator-launcher run --config nemo_evaluator.yaml
+#
+
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+
+execution:
+  type: slurm
+  hostname: <your_slurm_hostname>
+  username: ${oc.env:USER}
+  account: <your_slurm_account>
+  partition: batch
+  num_nodes: 1
+  ntasks_per_node: 1
+  gpus_per_node: 8
+  gres: "gpu:8"
+  walltime: 04:00:00
+  sbatch_comment: "{\"OccupiedIdleGPUsJobReaper\":{\"exemptIdleTimeMins\":\"1920\",\"reason\":\"benchmarking\",\"description\":\"Some evals need idle time\
+    \ else gets cancelled\"}}"
+  subproject: nel
+  output_dir: ${oc.env:SLURM_JOB_DIR}
+  mode: sequential
+
+  mounts:
+    mount_home: false
+  deployment:
+    n_tasks: 1
+
+# Note: Only tp=1 works for Nano (Mamba-based hybrid architecture)
+deployment:
+  # Update this to your Hugging Face checkpoint path (original, pruned or quantized)
+  checkpoint_path: <hf_model_name_or_path>
+  served_model_name: Nemotron-3-Nano-30B-A3B
+  port: 8000
+  tensor_parallel_size: 1
+  pipeline_parallel_size: 1
+  data_parallel_size: 8
+  gpu_memory_utilization: 0.90
+  extra_args: "--max-model-len 262144 --enable-log-requests --no-enable-prefix-caching --trust-remote-code --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice\
+    \ --tool-call-parser qwen3_coder --reasoning-parser-plugin /checkpoint/nano_v3_reasoning_parser.py --reasoning-parser nano_v3"
+  env_vars:
+    VLLM_FLASHINFER_MOE_BACKEND: throughput
+  endpoints:
+    chat: /v1/chat/completions
+    completions: /v1/completions
+    health: /health
+  multiple_instances: true
+
+evaluation:
+  nemo_evaluator_config:
+    target:
+      api_endpoint:
+        adapter_config:
+          use_system_prompt: true
+          use_reasoning: false
+          params_to_add:
+            chat_template_kwargs:
+              enable_thinking: true
+            skip_special_tokens: false
+          use_caching: true
+          tracking_requests_stats: true
+          log_failed_requests: true
+          use_request_logging: true
+          max_logged_requests: 10
+          use_response_logging: true
+          max_logged_responses: 10
+    config:
+      params:
+        parallelism: 64
+        max_new_tokens: 131072
+        temperature: 0.99999
+        top_p: 0.99999
+        request_timeout: 3600
+        max_retries: 10
+        extra:
+          tokenizer_backend: huggingface
+          # Update tokenizer path to match checkpoint_path above
+          tokenizer: <hf_model_name_or_path>
+  env_vars:
+    HF_TOKEN: HF_TOKEN
+    HF_HOME: HF_HOME
+    VLLM_CACHE_ROOT: VLLM_CACHE_ROOT
+    API_KEY: API_KEY
+    INFERENCE_API_KEY: INFERENCE_API_KEY
+    OPENAI_CLIENT_ID: OPENAI_CLIENT_ID
+    OPENAI_CLIENT_SECRET: OPENAI_CLIENT_SECRET
+
+  tasks:
+    # 1. MMLU Pro
+    - name: ns_mmlu_pro
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 1
+              args: "++prompt_config=eval/aai/mcq-10choices-boxed"
+
+    # 2. GPQA Diamond
+    - name: ns_gpqa
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 8
+              args: "++prompt_config=eval/aai/mcq-4choices"
+
+    # 3. LiveCodeBench
+    - name: ns_livecodebench
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 8
+              dataset_split: test_v6_2408_2505
+
+    # 4. AIME 2025
+    - name: ns_aime2025
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 64
+
+    # 5. IFBench
+    - name: ns_ifbench
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 8
+
+    # 6. SciCode
+    - name: ns_scicode
+      env_vars:
+        HF_TOKEN: HF_TOKEN
+      nemo_evaluator_config:
+        config:
+          params:
+            # limit_samples: 8
+            extra:
+              num_repeats: 8
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md
index 1786e88fdd..0a38b60df8 100644
--- a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md
@@ -1,10 +1,13 @@
-# Distillation Blend Ablations
+# Ablations: Nemotron-Nano-9B-v2
+
+## Distillation
 
 All experiments prune Nemotron-Nano-9B-v2 → 7B and distill with teacher = Nemotron-Nano-9B-v2 (official). The final chosen blend (**30pre_70post_v1v3**) is in [README.md](README.md).
 
 ---
 
-## Baseline: Pre-SFT-v1 Only (no post-training data)
+<details>
+<summary>Baseline: Pre-SFT-v1 Only (no post-training data) (click to expand)</summary>
 
 Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).
 
@@ -15,9 +18,12 @@ Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).
 
 **Notes:** Highest MMLU of any blend, but AIME stagnates and LCB lags. Pretraining data alone insufficient for reasoning benchmarks.
 
+</details>
+
 ---
 
-## Baseline: Pure Post-Training Data (pt-v1v2)
+<details>
+<summary>Baseline: Pure Post-Training Data (pt-v1v2) (click to expand)</summary>
 
 100% post-training data (no pretraining data), Nemotron-v1/v2 blend.
 
@@ -30,9 +36,12 @@ Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).
 
 **Notes:** IFEval degrades badly at longer training (41.9 at 20B). LCB lags behind other blends.
 
+</details>
+
 ---
 
-## 30% Pretraining / 70% Post-Training: v1v2 Blend
+<details>
+<summary>30% Pretraining / 70% Post-Training: v1v2 Blend (click to expand)</summary>
 
 30% Nemotron-Pretraining-SFT-v1 + 70% Nemotron-v1/v2 post-training data.
 
@@ -48,9 +57,12 @@ Pure Nemotron-Pretraining-SFT-v1 data only (no post-training reasoning traces).
 
 **Notes:** Best MMLU of the 30/70 blends (~1% above v3 blends). IFEval ~56–59 (lower than v3 blends). GPQA shows instability at longer runs.
 
+</details>
+
 ---
 
-## 30% Pretraining / 70% Post-Training: v3 Blend
+<details>
+<summary>30% Pretraining / 70% Post-Training: v3 Blend (click to expand)</summary>
 
 Refined v3 blend: dropped exercism/text2sql, added Nemotron-Math-v2 part01, boosted Math to 30% total.
 
@@ -65,9 +77,11 @@ Refined v3 blend: dropped exercism/text2sql, added Nemotron-Math-v2 part01, boos
 
 **Notes:** Better AIME and LCB than blend 1 at 40B+. GPQA still unstable (53.9 at 80B). MMLU ~1% below v1v2 blend.
 
+</details>
+
 ---
 
-## Blend Design Notes
+### Blend Design Notes
 
 **Why MMLU is ~1% lower with v3 blends:** The heavy reasoning-trace format (chain-of-thought, TIR) in v3 data suppresses general knowledge recall measured by MMLU. This is structural — v1v2 post-training data has a more knowledge-dense format. Upweighting Pretraining-SFT-v1 General (to 20%) partially mitigates this. Given that MMLU Pro is better with v3 blends, lower MMLU is acceptable.
 
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
index 620c5780a4..13ae6125a6 100644
--- a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md
@@ -18,11 +18,11 @@ End-to-end optimization of [Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/N
 | Model | MMLU | MMLU Pro | GPQA Diamond | LiveCodeBench v6 | AIME 2025 | Math 500 | IFEval | SciCode (Subtask) | Average |
 | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | Pruned 7B (no distillation) | 67.8 | 11.9 | 17.7 | 1.4 | 0.3 | 6.0 | 41.8 | 0.1 | 18.4 |
-| Pruned 7B + distill 2.5B tokens (400 iters) | 70.7 | 68.4 | 52.7 | 57.0 | 63.0 | 93.7 | 63.2 | 11.6 | 60.0 |
-| Pruned 7B + distill 20B tokens (3200 iters) | 71.3 | 71.7 | 54.8 | 62.0 | 69.1 | 95.2 | 63.8 | 20.9 | 63.6 |
-| Pruned 7B + distill 40B tokens (6400 iters) | 71.1 | 71.6 | 53.7 | 60.9 | 70.4 | 95.6 | 68.0 | 21.1 | 64.1 |
-| Pruned 7B + distill 60B tokens (9600 iters) | 72.1 | 72.1 | 54.9 | 61.6 | 70.3 | 95.4 | 64.7 | 24.1 | 64.4 |
-| Pruned 7B + distill 80B tokens (12800 iters) | 72.2 | 73.0 | 56.9 | 62.6 | 72.0 | 95.8 | 66.2 | 22.2 | 65.1 |
+| Distill @ 2.5B tokens (400 iters) | 70.7 | 68.4 | 52.7 | 57.0 | 63.0 | 93.7 | 63.2 | 11.6 | 60.0 |
+| Distill @  20B tokens (3200 iters) | 71.3 | 71.7 | 54.8 | 62.0 | 69.1 | 95.2 | 63.8 | 20.9 | 63.6 |
+| Distill @  40B tokens (6400 iters) | 71.1 | 71.6 | 53.7 | 60.9 | 70.4 | 95.6 | 68.0 | 21.1 | 64.1 |
+| Distill @  60B tokens (9600 iters) | 72.1 | 72.1 | 54.9 | 61.6 | 70.3 | 95.4 | 64.7 | 24.1 | 64.4 |
+| Distill @  80B tokens (12800 iters) | 72.2 | 73.0 | 56.9 | 62.6 | 72.0 | 95.8 | 66.2 | 22.2 | 65.1 |
 | Nemotron-Nano-9B-v2 (official, pruned from 12B) | 74.7 | 74.9 | 56.1 | 64.4 | 73.2 | 95.9 | 65.8 | 21.9 | 65.9 |
 | Nemotron-Nano-12B-v2 (official) | 78.5 | 77.9 | 58.2 | 66.6 | 76.1 | 96.9 | 67.9 | 28.4 | 68.8 |
 
@@ -49,7 +49,7 @@ End-to-end optimization of [Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/N
 Distillation uses the **30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)** blend (see [Data Blend](#data-blend) below). Blend ablations are in [ABLATIONS.md](ABLATIONS.md).
 
 > [!NOTE]
-> Exact numbers may vary depending on deployment and evaluation setup. All models above — including the official 9B and 12B — were evaluated with the same [nemo_evaluator.yaml](nemo_evaluator.yaml) for fair comparison. These numbers may differ from those reported on the official [Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) and [Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2) HuggingFace model cards.
+> Exact numbers may vary depending on deployment and evaluation setup. All models above — including the official 9B and 12B — were evaluated once with the same [evaluation setup](#4-evaluation) for fair comparison. These numbers may differ from those reported on the official [Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) and [Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2) HuggingFace model cards.
 
 > [!NOTE]
 > The official Nemotron-Nano-9B-v2 model was itself produced by pruning Nemotron-Nano-12B-v2 using Minitron. See [arxiv:2508.14444](https://arxiv.org/abs/2508.14444) for details on the exact steps used there.
@@ -68,6 +68,24 @@ For this experiment: `TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2`, `OUTPUT_DIR=
 
 **30% Pretraining (Code 5, General 20, MATH 5) + 70% Post-training v1/v3 (Math 30, Coding 20, Science 15, IF 5)**
 
+| Dataset | Tokens | Weight | Notes |
+| --- | --- | --- | --- |
+| Nemotron-Pretraining-SFT-v1 / Code (10M samples) | 7B | 5 | Pretraining code |
+| Nemotron-Pretraining-SFT-v1 / General (10M samples) | 16B | 20 | Upweighted to better close MMLU gap |
+| Nemotron-Pretraining-SFT-v1 / MATH (10M samples) | 12B | 5 | Pretraining math |
+| Nemotron-Math-v2 / high_part00 | 9B | 15 | Hard math reasoning |
+| Nemotron-Math-v2 / high_part01 | 11B | 15 | Hard math reasoning |
+| Nemotron-SFT-Competitive-Programming-v2 / python_00 | 7B | 15 | Python reasoning traces |
+| Nemotron-SFT-Competitive-Programming-v2 / cpp_00 | 7B | 5 | C++ reasoning traces |
+| Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 20B | 10 | Broad STEM |
+| Nemotron-Science-v1 / MCQ | 0.5B | 3 | GPQA MCQ format alignment |
+| Nemotron-Science-v1 / RQA | 0.3B | 2 | GPQA format diversity |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_on | 2B | 3 | Instruction following (thinking on) |
+| Nemotron-SFT-IF-Chat-v2 / reasoning_off | 1B | 2 | Instruction following (thinking off) |
+
+<details>
+<summary>Data blend for distillation (click to expand)</summary>
+
 ```bash
 DATA_BLEND=" \
 5  tokenized_nemotron_v2/nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000 \
@@ -85,20 +103,7 @@ DATA_BLEND=" \
 "
 ```
 
-| Dataset | Tokens | Weight | Notes |
-| --- | --- | --- | --- |
-| Nemotron-Pretraining-SFT-v1 / Code (10M samples) | 7B | 5 | Pretraining code |
-| Nemotron-Pretraining-SFT-v1 / General (10M samples) | 16B | 20 | Upweighted to better close MMLU gap |
-| Nemotron-Pretraining-SFT-v1 / MATH (10M samples) | 12B | 5 | Pretraining math |
-| Nemotron-Math-v2 / high_part00 | 9B | 15 | Hard math reasoning |
-| Nemotron-Math-v2 / high_part01 | 11B | 15 | Hard math reasoning |
-| Nemotron-SFT-Competitive-Programming-v2 / python_00 | 7B | 15 | Python reasoning traces |
-| Nemotron-SFT-Competitive-Programming-v2 / cpp_00 | 7B | 5 | C++ reasoning traces |
-| Nemotron-Post-Training-Dataset-v1 / stem (5M samples) | 20B | 10 | Broad STEM |
-| Nemotron-Science-v1 / MCQ | 0.5B | 3 | GPQA MCQ format alignment |
-| Nemotron-Science-v1 / RQA | 0.3B | 2 | GPQA format diversity |
-| Nemotron-SFT-IF-Chat-v2 / reasoning_on | 2B | 3 | Instruction following (thinking on) |
-| Nemotron-SFT-IF-Chat-v2 / reasoning_off | 1B | 2 | Instruction following (thinking off) |
+</details>
 
 #### General Guidelines
 
@@ -115,7 +120,7 @@ This blend intentionally omits capabilities not targeted in this experiment (e.g
 | --- | --- |
 | Multilingual | `Nemotron-SFT-Multilingual-v1` |
 | Agentic / tool use | `Nemotron-SFT-Tool-Call-v1`, `Nemotron-SFT-Tool-Call-v2` |
-| Software engineering (SWE) | `Nemotron-SFT-SWE-v1` |
+| Software engineering (SWE) | `Nemotron-SFT-SWE-v2` |
 | Safety / alignment | `Nemotron-SFT-Safety-v1` |
 | Long context | `Nemotron-SFT-Long-Context-v1` |
 
@@ -125,9 +130,12 @@ When adding new datasets, reduce weights of lower-priority categories proportion
 
 ### 2. Pruning
 
+Here we prune the model from 9B to 7B parameters.
+
 Run on **1 node with 8x H100** (~1 hour)
 
-Non-default arguments: `--hparams_to_skip num_attention_heads` (default: none; attention heads pruning is harder to recover hence skipped), `--seq_length 8192` (default: 4096) since dataset has longer sequences. All other arguments use defaults i.e. we optimize for MMLU (10% subset, 0-shot) for the pruned model (without distillation).
+<details>
+<summary>Pruning command (click to expand)</summary>
 
 ```bash
 torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
@@ -140,7 +148,15 @@ torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/prune_
   --output_hf_path /path/to/Nemotron-Nano-9B-v2-Pruned-7B
 ```
 
-Important pruning logs:
+Non-default arguments:
+
+- `--hparams_to_skip num_attention_heads` (default: none) — attention heads pruning is harder to recover, hence skipped
+- `--seq_length 8192` (default: 4096) — dataset has longer sequences
+
+</details>
+
+<details>
+<summary>Pruning logs (top 10 candidates, best subnet, layer patterns) (click to expand)</summary>
 
 ```text
 Only considering atmost 40% for width and 20% for depth pruning hparams
@@ -169,19 +185,22 @@ Original hybrid_override_pattern: M-M-M-MM-M-M-M*-M-M-M*-M-M-M-M*-M-M-M-M*-M-MM-
 Pruned hybrid_override_pattern: M-M-M-MM-M-M-M*-M-M-M*-M-M-M-M*-M-M-M-M*-MMMM-M-
 ```
 
+</details>
+
 > [!TIP]
-> Here we skip the Knowledge Distillation (KD) step for candidates for simplicity. If you want to find a better pruned model, you can take the top K candidates' `export_config` from the logs above and then export all models separately and perform KD for ~2B tokens on each of them before selecting the best subnet based on your desired metrics.
+> Here we skip the Knowledge Distillation (KD) step for candidates for simplicity. If you want to find a better pruned model, you can take few top candidates' `export_config` from the logs above (where score is in similar range as the best subnet) and then export all models separately and perform KD for ~2B tokens on each of them before selecting the best subnet based on your desired metrics.
 
 ---
 
 ### 3. Distillation
 
-Non-default arguments: `--seq_length 8192` (default: 4096), `--mbs 4` (default: 1), `--train_iters 16000` (train upto ~100B tokens — can stop earlier and take intermediate checkpoints for smaller runs), `--lr_warmup_iters 100` (default: 50), `--eval_interval 400` (default: 100). All other arguments use defaults.
-
 Run on **96 nodes × 8x H100 (768 GPUs total)**. ~600 H100 GPU-hours per 1k steps (~6.3B tokens), i.e. ~45 min wall-clock per 1k steps. Full 80B token run (~13k steps) takes ~9k H100 GPU-hours (~10 hours wall-clock).
 
 >[!TIP]
-> While we use 96 nodes here for faster training, you can also run with 1 node. If you dont want to do full distillation run, you can stop earlier and take intermediate checkpoints as well.
+> While we use 96 nodes here for faster training, you can also run with 1 node. If you dont want to do full distillation run, you can stop earlier and take intermediate checkpoints as well. See results for intermediate checkpoints at the top of this README.
+
+<details>
+<summary>Distillation command (click to expand)</summary>
 
 ```bash
 torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/distill_minitron.py \
@@ -210,6 +229,17 @@ torchrun --nproc_per_node 8 /opt/Model-Optimizer/examples/megatron_bridge/distil
 #     --wandb_exp_name <wandb_exp_name>
 ```
 
+Non-default arguments:
+
+- `--seq_length 8192` (default: 4096)
+- `--mbs 4` (default: 1) - use as large as possible to maximize throughput
+- `--train_iters 16000` (train upto ~100B tokens — can stop earlier and take intermediate checkpoints for smaller runs)
+- `--lr_warmup_iters 100` (default: 50)
+- `--eval_interval 400` (default: 100) — less frequent eval to save compute
+- All other arguments use defaults.
+
+</details>
+
 For multi-node Slurm runs, see the [Megatron-Bridge README](../../../megatron_bridge/README.md#slurm-usage) for details.
 
 Distillation saves checkpoints in Megatron distributed format under `<output_dir>/checkpoints/iter_XXXXXXX`. You can convert any intermediate checkpoint to HuggingFace format using the Megatron-Bridge conversion script (see [Megatron Bridge README](../../../megatron_bridge/README.md) for full details):
@@ -221,13 +251,16 @@ python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
     --hf-path <output_dir>/checkpoints/hf_iter_<iter_number>
 ```
 
+> [!NOTE]
+> This is pure SFT-style distillation — no RL or online reward signal is used. Adding an RL-based post-training step after distillation is a natural next step that could further improve some of these benchmarks.
+
 ---
 
 ### 4. Evaluation
 
-The eval config xin [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it. For local model execution and evaluation, refer to the [NeMo Evaluator documentation](https://docs.nvidia.com/nemo/evaluator/latest/) or this [blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe).
+The eval config in [nemo_evaluator.yaml](nemo_evaluator.yaml) is for Slurm-based evaluation — it submits a vLLM serving job and runs evals against it. For local model execution and evaluation, refer to the [NeMo Evaluator documentation](https://docs.nvidia.com/nemo/evaluator/latest/) or this [blog](https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe).
 
-Before running, update the following fields in the yaml:
+Before running, update the following fields in the yaml or overwrite them in the command line with `-o <option>=<value>`:
 
 - `execution.hostname` — your Slurm login node hostname
 - `execution.account` — your Slurm account
@@ -255,9 +288,12 @@ export OPENAI_CLIENT_SECRET=xxxxxx
 nemo-evaluator-launcher run --config nemo_evaluator.yaml
 ```
 
+> [!TIP]
+> Run same evals multiple times to get a more stable result.
+
 **Tasks and exact metric names reported in the results table:**
 
-| Benchmark | Tool | Metric name |
+| Benchmark | Library | Metric name |
 | --- | --- | --- |
 | MMLU | [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (5-shot) | `mmlu` |
 | MMLU Pro | NeMo Evaluator | `mmlu-pro_pass_at_1_symbolic_correct` |
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png
index 40c507bd1b..59844ef9c6 100644
Binary files a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png and b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/figures/learning_curves.png differ
diff --git a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
index 256a4031be..9a534a47aa 100644
--- a/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
+++ b/examples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yaml
@@ -50,8 +50,6 @@ execution:
     mount_home: false
   deployment:
     n_tasks: 1
-  batch_comment: "{\"OccupiedIdleGPUsJobReaper\":{\"exemptIdleTimeMins\":\"1920\",\"reason\":\"benchmarking\",\"description\":\"Required data validation\
-    \ and evaluation\"}}"
 
 # Note: Only tp=1 works for Nano (Mamba-based architecture)
 deployment:
diff --git a/modelopt/torch/utils/plugins/megatron_preprocess_data.py b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
index 81dac1580b..0aa29f1c55 100644
--- a/modelopt/torch/utils/plugins/megatron_preprocess_data.py
+++ b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
@@ -161,15 +161,49 @@ def _process_messages(self, messages: list[dict]) -> list[dict]:
         """
         if self.reasoning_content == "native":
             return messages
+
+        def _fix_arguments(args):
+            """Ensure tool_call.arguments is a dict for Jinja2 |items compatibility."""
+            if isinstance(args, dict):
+                return args
+            if isinstance(args, str):
+                try:
+                    parsed = json.loads(args)
+                    return parsed if isinstance(parsed, dict) else {}
+                except (json.JSONDecodeError, TypeError):
+                    return {}
+            return {}
+
         processed = []
         for msg in messages:
-            if "reasoning_content" not in msg:
+            has_tool_calls = "tool_calls" in msg and isinstance(msg.get("tool_calls"), list)
+            needs_copy = "reasoning_content" in msg or has_tool_calls
+            if not needs_copy:
                 processed.append(msg)
                 continue
             msg = dict(msg)  # shallow copy — don't mutate the original
-            rc = msg.pop("reasoning_content")
+            rc = msg.pop("reasoning_content", None)
             if self.reasoning_content == "inline" and rc:
                 msg["content"] = f"<think>\n{rc}\n</think>\n{msg.get('content', '')}"
+            # Always normalize tool_call.arguments to dict so Jinja2 |items doesn't crash.
+            # The Nemotron v3 chat template reassigns tool_call = tool_call.function when
+            # the nested OpenAI format is used, so we fix both the direct and nested levels.
+            if has_tool_calls:
+                fixed_tool_calls = []
+                for tc in msg["tool_calls"]:
+                    if not isinstance(tc, dict):
+                        fixed_tool_calls.append(tc)
+                        continue
+                    tc = dict(tc)
+                    if "arguments" in tc and not isinstance(tc["arguments"], dict):
+                        tc["arguments"] = _fix_arguments(tc["arguments"])
+                    if isinstance(tc.get("function"), dict):
+                        fn = dict(tc["function"])
+                        if "arguments" in fn and not isinstance(fn["arguments"], dict):
+                            fn["arguments"] = _fix_arguments(fn["arguments"])
+                        tc["function"] = fn
+                    fixed_tool_calls.append(tc)
+                msg["tool_calls"] = fixed_tool_calls
             processed.append(msg)
         return processed