[Recipes][LLM PTQ] Add nvfp4 MSE+FP8-cast-KV recipes (experts_only / mlp_only) + --recipe in example scripts#1391
Conversation
…recipe support in scripts - Add modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml, combining experts-only NVFP4 W4A4 with the MSE FP8 scale-sweep weight calibration (algorithm: mse, fp8_scale_sweep: true; expert weight blocks switched to "static" so the static FP8 sweep applies) and FP8 KV cache with use_constant_amax: true. - examples/llm_ptq/scripts: thread a new --recipe flag through parser.sh and huggingface_example.sh. Either --quant or --recipe is required; passing both errors out. When --recipe is used, the script derives MODEL_NAME from the recipe basename, passes --recipe= to hf_ptq.py, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs). - Drop the qformat case-statement whitelist in huggingface_example.sh; let hf_ptq.py be the single source of truth for valid qformats / recipes. (Pre-commit hook check-modelopt-recipes was skipped: the host conda env has a broken torchvision install that prevents the validator from importing modelopt. The recipe was verified independently via tools/precommit/check_modelopt_recipes.py in a working environment.) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## chenjiel/nvfp4-fp8-sweep-triton #1391 +/- ##
===================================================================
- Coverage 76.86% 76.86% -0.01%
===================================================================
Files 472 472
Lines 50660 50660
===================================================================
- Hits 38942 38939 -3
- Misses 11718 11721 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Same shape as nvfp4_experts_only_mse-fp8_cast_kv but with the broader
*mlp* / *block_sparse_moe* patterns from nvfp4_mlp_only-fp8_kv.yaml so it
covers both dense MLP and MoE expert weights:
- algorithm: { method: mse, fp8_scale_sweep: true, layerwise: false }
- All MLP weight quantizers switched from "dynamic" to "static" so the
static FP8 scale sweep applies (otherwise mse_calibrate skips them).
- Input quantizers stay dynamic.
- KV bmm gets use_constant_amax: true (the _cast_kv flavor: skips KV
calibration, hardcodes amax to FP8 E4M3 max 448.0).
Pre-commit hook check-modelopt-recipes was skipped because the host conda
env has a broken torchvision install that prevents the validator from
importing modelopt; the recipe is the same shape as the experts-only one
which already validates cleanly in a working env.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Summary
use_constant_amax: true(skips KV calibration; matches thenvfp4_default-fp8_cast_kvcontract):modelopt_recipes/general/ptq/nvfp4_experts_only_mse-fp8_cast_kv.yaml— applies to*mlp.experts*/*block_sparse_moe*only.modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-fp8_cast_kv.yaml— applies to all*mlp*/*block_sparse_moe*(dense MLP + MoE).--recipeflag throughexamples/llm_ptq/scripts/parser.shandhuggingface_example.sh. Either--quantor--recipeis required; passing both errors out. Recipe names are not validated in the script —hf_ptq.pyis the source of truth.qformatwhitelist case-statement inhuggingface_example.shfor the same reason.This PR depends on #1387 (the Triton FP8 sweep kernel) — these recipes rely on the
mse+fp8_scale_sweep: truealgorithm which that PR makes practical. Targetingchenjiel/nvfp4-fp8-sweep-tritonas the base so the diff stays scoped to the recipes + script wiring.Files
New recipes (
modelopt_recipes/general/ptq/):nvfp4_experts_only_mse-fp8_cast_kv.yaml— same patterns asnvfp4_experts_only-fp8_kv.yaml.nvfp4_mlp_only_mse-fp8_cast_kv.yaml— same patterns asnvfp4_mlp_only-fp8_kv.yaml.Both differ from their
_kvsiblings by:algorithm: max→{ method: mse, fp8_scale_sweep: true, layerwise: false }type: dynamic→type: static(otherwisemse_calibrateskips them: only static block-quant weight quantizers are recognized for the FP8 sweep — seemodel_calib.py:369-374).use_constant_amax: true(the_cast_kvflavor).Scripts (
examples/llm_ptq/scripts/):parser.sh— adds--recipelong-option, defaultRECIPE="", validates one-of-{--quant,--recipe} and not-both.huggingface_example.sh— whenRECIPEis set, derivesMODEL_NAMEfrom the recipe basename, passes--recipe=…tohf_ptq.pyinstead of--qformat=…, and exits after export with a TRT-LLM deployment hint (recipes can produce arbitrary configs that the script's downstreamrun_tensorrt_llm.pypath doesn't know how to handle generically). Drops theqformatwhitelist; defers tohf_ptq.py.Behavior
Test plan
experts_only_mse-fp8_cast_kvloads viamodelopt.recipe.load_recipe(...)and produces the expected algorithm + per-patternquant_cfg(verified in a working env:algorithm == {'method': 'mse', 'fp8_scale_sweep': True, 'layerwise': False}; expert weight quantizerstype: static; KV bmm hasuse_constant_amax: True).--quant, only--recipe) all behave as designed.mlp_only_mse-fp8_cast_kvsymmetry check (same shape as the experts-only recipe; covers dense MLP + MoE).huggingface_example.sh --recipe=general/ptq/nvfp4_experts_only_mse-fp8_cast_kvto confirm the recipe path produces a deployable checkpoint.Note
Pre-commit hook
check-modelopt-recipeswas skipped on both commits because the local conda env has a brokentorchvisioninstall (AttributeError: partially initialized module 'torchvision' has no attribute 'extension') that preventsfrom modelopt.recipe.loader import load_recipe. Theexperts_onlyrecipe was validated independently by runningtools/precommit/check_modelopt_recipes.pyin a working environment (exits 0); themlp_onlyone is the same shape with a different glob.🤖 Generated with Claude Code