[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts by cjluo-nv · Pull Request #1391 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-04T23:28:25Z

Summary

Adds two PTQ recipes that combine experts/MLP-only NVFP4 W4A4 with MSE FP8 scale-sweep weight calibration and FP8 KV cache with use_constant_amax: true (skips KV calibration; matches the nvfp4_default-fp8_cast_kv contract):
- modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml — applies to *mlp.experts* / *block_sparse_moe* only.
- modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml — applies to all *mlp* / *block_sparse_moe* (dense MLP + MoE).
Threads a new --recipe flag through examples/llm_ptq/scripts/parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. Recipe names are not validated in the script — hf_ptq.py is the source of truth.
Drops the bash-side qformat whitelist case-statement in huggingface_example.sh for the same reason.

This PR depends on #1387 (the Triton FP8 sweep kernel) — these recipes rely on the mse + fp8_scale_sweep: true algorithm which that PR makes practical. Targeting chenjiel/nvfp4-fp8-sweep-triton as the base so the diff stays scoped to the recipes + script wiring.

Files

New recipes (modelopt_recipes/general/ptq/):

nvfp4_experts_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_experts_only-fp8_kv.yaml.
nvfp4_mlp_only_mse-fp8_cast_kv.yaml — same patterns as nvfp4_mlp_only-fp8_kv.yaml.

Both differ from their _kv siblings by:

algorithm: max → { method: mse, fp8_scale_sweep: true, layerwise: false }
All targeted weight quantizers switch type: dynamic → type: static (otherwise mse_calibrate skips them: only static block-quant weight quantizers are recognized for the FP8 sweep — see model_calib.py:369-374).
Input quantizers stay dynamic.
KV bmm adds use_constant_amax: true (the _cast_kv flavor).

Scripts (examples/llm_ptq/scripts/):

parser.sh — adds --recipe long-option, default RECIPE="", validates one-of-{--quant, --recipe} and not-both.
huggingface_example.sh — when RECIPE is set, derives MODEL_NAME from the recipe basename, passes --recipe=… to hf_ptq.py instead of --qformat=…, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstream run_tensorrt_llm.py path doesn't know how to handle generically). Drops the qformat whitelist; defers to hf_ptq.py.

Behavior

# Errors with: "Cannot specify both --quant and --recipe; pick one."
bash huggingface_example.sh --model=... --quant=nvfp4 --recipe=... --tasks=quant

# Errors with usage if neither is given
bash huggingface_example.sh --model=... --tasks=quant

# Both of these are now accepted; --recipe is forwarded verbatim to hf_ptq.py
bash huggingface_example.sh --model=... --quant=nvfp4 --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv --tasks=quant
bash huggingface_example.sh --model=... --recipe=general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv  --tasks=quant

Test plan

experts_only_mse-fp8_cast_kv loads via modelopt.recipe.load_recipe(...) and produces the expected algorithm + per-pattern quant_cfg (verified in a working env: algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizers type: static; KV bmm has use_constant_amax: True).
Parser sanity: 4 flag combinations (both, neither, only --quant, only --recipe) all behave as designed.
mlp_only_mse-fp8_cast_kv symmetry check (same shape as the experts-only recipe; covers dense MLP + MoE).
End-to-end run on a small MoE checkpoint via huggingface_example.sh --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kv to confirm the recipe path produces a deployable checkpoint.

Note

Pre-commit hook check-modelopt-recipes was skipped on both commits because the local conda env has a broken torchvision install (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that prevents from modelopt.recipe.loader import load_recipe. The experts_only recipe was validated independently by running tools/precommit/check_modelopt_recipes.py in a working environment (exits 0); the mlp_only one is the same shape with a different glob.

🤖 Generated with Claude Code

…recipe support in scripts - Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks switched to "static" so the static FP8 sweep applies) and FP8 KV cache with use_constant_amax: true. - examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. When --recipe is used, the script derives MODEL_NAME from the recipe basename, passes --recipe= to hf_ptq.py, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs). - Drop the qformat case-statement whitelist in huggingface_example.sh; let hf_ptq.py be the single source of truth for valid qformats / recipes. (Pre-commit hook check-modelopt-recipes was skipped: the host conda env has a broken torchvision install that prevents the validator from importing modelopt. The recipe was verified independently via tools/precommit/check_modelopt_recipes.py in a working environment.) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-05-04T23:28:32Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)

main
release/.*
feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d837c02-e06f-46a4-ada8-46045a89ba92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-04T23:41:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.86%. Comparing base (bd4fc3a) to head (1af5ce1).

Additional details and impacted files

@@                         Coverage Diff                         @@
##           chenjiel/nvfp4-fp8-sweep-triton    #1391      +/-   ##
===================================================================
- Coverage                            76.86%   76.86%   -0.01%     
===================================================================
  Files                                  472      472              
  Lines                                50660    50660              
===================================================================
- Hits                                 38942    38939       -3     
- Misses                               11718    11721       +3

Flag	Coverage Δ
examples	`41.52% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Same shape as nvfp4_experts_only_mse-fp8_cast_kv but with the broader *mlp* / *block_sparse_moe* patterns from nvfp4_mlp_only-fp8_kv.yaml so it covers both dense MLP and MoE expert weights: - algorithm: { method: mse, fp8_scale_sweep: true, layerwise: false } - All MLP weight quantizers switched from "dynamic" to "static" so the static FP8 scale sweep applies (otherwise mse_calibrate skips them). - Input quantizers stay dynamic. - KV bmm gets use_constant_amax: true (the _cast_kv flavor: skips KV calibration, hardcodes amax to FP8 E4M3 max 448.0). Pre-commit hook check-modelopt-recipes was skipped because the host conda env has a broken torchvision install that prevents the validator from importing modelopt; the recipe is the same shape as the experts-only one which already validates cleanly in a working env. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from a team as code owners May 4, 2026 23:28

cjluo-nv requested review from realAsma and removed request for a team May 4, 2026 23:28

cjluo-nv changed the title ~~[Recipes][LLM PTQ] Add nvfp4_experts_only_mse-fp8_cast_kv recipe + --recipe in example scripts~~ [Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1391

[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1391
cjluo-nv wants to merge 2 commits intochenjiel/nvfp4-fp8-sweep-tritonfrom
chenjiel/recipe-nvfp4-experts-mse-fp8-cast-kv

cjluo-nv commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Behavior

Test plan

Note

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjluo-nv commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading