NVIDIA · kaix-nv · Apr 27, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -40,6 +40,8 @@ Test that `nel` is installed with `nel --version`. If not, instruct the user to
 
 If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.
 
+**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Steps 2 and 5 (config generation and task confirmation), but still do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in required `???` values from user input), then proceed to Step 7.5/8.
+
 **Step 2: Build the base config file**
 
 Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
@@ -123,6 +125,29 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi
 
 > **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below.
 
+**Auto-detect deployment settings from checkpoint:**
+
+Read `config.json` from the checkpoint (or HF model card) and build `deployment.extra_args` dynamically:
+
+```bash
+cat <checkpoint_path>/config.json 2>/dev/null
+```
+
+| Field in `config.json` | What to set | Example |
+| --- | --- | --- |
+| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
+| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+
-| Field in `config.json` | What to set | Example |
-| --- | --- | --- |
-| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| Field in `config.json` | What to set | Example |
+| --- | --- | --- |
+| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
-| Field in `config.json` | What to set | Example |
-| --- | --- | --- |
-| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| Field in `config.json` | What to set | Example |
+| --- | --- | --- |
+| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
+Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings:
+
+| Model card signal | What to set |
+| --- | --- |
+| Reasoning model (thinking/CoT) | `--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided |
+| Tool-calling support | `--enable-auto-tool-choice --tool-call-parser <parser>` |
+| Custom vLLM flags documented | Add as specified (e.g., `--mamba_ssm_cache_dtype float32`) |
+
+Combine all detected flags into a single `deployment.extra_args` override. The recipe's default `--max-model-len 32768` is a fallback — always prefer the value from `config.json`.
+
 **Quantization-aware benchmark defaults:**
 
 When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
@@ -218,7 +243,13 @@ ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
 
 Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
 
-**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
+**Important**: Export required environment variables based on your config. If any tokens or keys are missing, point the user to `recipes/env.example` — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it:
+
+```bash
+cp recipes/env.example .env
+# Edit .env with your keys
+set -a && source .env && set +a
+```
 
 ```bash
 # If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):

diff --git a/.claude/skills/evaluation/recipes/env.example b/.claude/skills/evaluation/recipes/env.example
@@ -0,0 +1,29 @@
+# Evaluation API Keys
+#
+# Copy this file and fill in the keys you need:
+#   cp recipes/env.example .env
+#   # Edit .env with your keys
+#   set -a && source .env && set +a
+#
+# Not all keys are required — only fill in what your tasks need.
+
+# Required for all tasks (model/dataset downloads)
+HF_TOKEN=hf_...
+
+# Required for nemo_skills.* tasks (dummy value, not a real key)
+DUMMY_API_KEY=dummy
+
+# Required for NEL pre_cmd execution
+NEMO_EVALUATOR_TRUST_PRE_CMD=1
+
+# --- Optional: task-specific keys ---
+
+# AIME 2025 (simple_evals variant only, not ns_aime2025)
+# JUDGE_API_KEY=
+
+# tau2_bench_telecom (LLM judge)
+# JUDGE_API_KEY_NVDEV_QWEN235B=
+
+# terminal-bench-hard (AWS sandbox)
+# AWS_ACCESS_KEY_ID=
+# AWS_SECRET_ACCESS_KEY=
diff --git a/.claude/skills/evaluation/recipes/examples/example_eval.yaml b/.claude/skills/evaluation/recipes/examples/example_eval.yaml
@@ -0,0 +1,108 @@
+# Example: Quantization Validation Suite
+#
+# A balanced set of benchmarks for validating quantized model quality.
+# Copy this file and customize for your needs.
+#
+# Includes:
+#   - MMLU-Pro (knowledge, completions)
+#   - GPQA Diamond (reasoning, chat, 5 repeats)
+#   - LiveCodeBench v6 (code, chat, 3 repeats)
+#   - IFBench (instruction following, chat, 8 repeats)
+#
+# Usage:
+#   nel run --config recipes/examples/example_eval.yaml \
+#     -o deployment.checkpoint_path=/path/to/quantized/checkpoint \
+#     -o deployment.served_model_name=my-model-nvfp4 \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output
+#
+# For quantized checkpoints, also add the quantization flag:
+#   -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4'
+#
+# Run a single task:
+#   nel run --config ... -t ns_gpqa
+#
+# Smoke test (2 samples):
+#   nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+execution:
+  hostname: ???
+  username: ${oc.env:USER}
+  account: ???
+  output_dir: ???
+  walltime: "04:00:00"
+  mounts:
+    mount_home: false
+deployment:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  checkpoint_path: ???
+  hf_model_handle:
+  served_model_name: ???
+  tensor_parallel_size: 1
+  data_parallel_size: 1
+  # For models with custom code, add: --trust-remote-code
+  extra_args: --max-model-len 32768
+evaluation:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  nemo_evaluator_config:
+    config:
+      params:
+        request_timeout: 3600
+        max_retries: 10
+        parallelism: 16
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
+  tasks:
+  # Knowledge (completions endpoint, short)
+    - name: adlr_mmlu_pro_5_shot_base
+
+  # Reasoning (chat endpoint, 5 repeats, short)
+    - name: ns_gpqa
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              args: ++prompt_config=eval/aai/mcq-4choices
+              num_repeats: 5
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens
+
+  # Code (chat endpoint, 3 repeats, medium)
+    - name: ns_livecodebench
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              dataset_split: test_v6_2408_2505
+              num_repeats: 3
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens
+
+  # Instruction following (chat endpoint, 8 repeats, super short)
+    - name: ns_ifbench
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              num_repeats: 8
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens
diff --git a/.claude/skills/evaluation/recipes/tasks/aime2025.yaml b/.claude/skills/evaluation/recipes/tasks/aime2025.yaml
@@ -0,0 +1,65 @@
+# AIME 2025 (NeMo Skills, chat)
+#
+# Math competition benchmark. Uses the chat endpoint.
+# Primary metric: pass@1[avg-of-16] symbolic_correct
+# Run time: Long (reasoning models generate lengthy thinking traces)
+# Repeats: 16
+#
+# Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY
+# for LLM-based scoring. This NeMo Skills variant uses symbolic scoring
+# and does not require external API keys.
+#
+# Usage:
+#   nel run --config recipes/tasks/aime2025.yaml \
+#     -o deployment.checkpoint_path=/path/to/checkpoint \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output \
+#     -o deployment.served_model_name=<model_name>
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+execution:
+  hostname: ???
+  username: ${oc.env:USER}
+  account: ???
+  output_dir: ???
+  walltime: "04:00:00"
+  mounts:
+    mount_home: false
+deployment:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  checkpoint_path: ???
+  hf_model_handle:
+  served_model_name: ???
+  tensor_parallel_size: 1
+  data_parallel_size: 1
+  # For models with custom code, add: --trust-remote-code
+  extra_args: --max-model-len 32768
+evaluation:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  nemo_evaluator_config:
+    config:
+      params:
+        request_timeout: 100000
+        max_retries: 10
+        parallelism: 16
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
+  tasks:
+    - name: ns_aime2025
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              num_repeats: 16
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens
diff --git a/.claude/skills/evaluation/recipes/tasks/gpqa.yaml b/.claude/skills/evaluation/recipes/tasks/gpqa.yaml
@@ -0,0 +1,62 @@
+# GPQA Diamond (NeMo Skills, chat)
+#
+# Graduate-level reasoning benchmark. Uses the chat endpoint.
+# Primary metric: pass@1[avg-of-5] symbolic_correct
+# Run time: Short
+# Repeats: 5
+#
+# Usage:
+#   nel run --config recipes/tasks/gpqa.yaml \
+#     -o deployment.checkpoint_path=/path/to/checkpoint \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output \
+#     -o deployment.served_model_name=<model_name>
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+execution:
+  hostname: ???
+  username: ${oc.env:USER}
+  account: ???
+  output_dir: ???
+  walltime: "02:00:00"
+  mounts:
+    mount_home: false
+deployment:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  checkpoint_path: ???
+  hf_model_handle:
+  served_model_name: ???
+  tensor_parallel_size: 1
+  data_parallel_size: 1
+  # For models with custom code, add: --trust-remote-code
+  extra_args: --max-model-len 32768
+evaluation:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  nemo_evaluator_config:
+    config:
+      params:
+        request_timeout: 3600
+        max_retries: 5
+        parallelism: 16
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
+  tasks:
+    - name: ns_gpqa
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              args: ++prompt_config=eval/aai/mcq-4choices
+              num_repeats: 5
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens
diff --git a/.claude/skills/evaluation/recipes/tasks/ifbench.yaml b/.claude/skills/evaluation/recipes/tasks/ifbench.yaml
@@ -0,0 +1,61 @@
+# IFBench (NeMo Skills, chat)
+#
+# Instruction following benchmark. Uses the chat endpoint.
+# Primary metric: pass@1[avg-of-8] prompt_strict_accuracy
+# Run time: Super Short
+# Repeats: 8
+#
+# Usage:
+#   nel run --config recipes/tasks/ifbench.yaml \
+#     -o deployment.checkpoint_path=/path/to/checkpoint \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output \
+#     -o deployment.served_model_name=<model_name>
-# Usage:
-#   nel run --config recipes/tasks/ifbench.yaml \
-#     -o deployment.checkpoint_path=/path/to/checkpoint \
-#     -o execution.hostname=<slurm_host> \
-#     -o execution.account=<slurm_account> \
-#     -o execution.output_dir=/path/to/output \
-#     -o deployment.served_model_name=<model_name>
+# Usage:
+#   cp recipes/env.example .env
+#   # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
+#   set -a && source .env && set +a
+#   nel run --config recipes/tasks/ifbench.yaml \
+#     -o deployment.checkpoint_path=/path/to/checkpoint \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output \
+#     -o deployment.served_model_name=<model_name>
-# Usage:
-#   nel run --config recipes/tasks/ifbench.yaml \
-#     -o deployment.checkpoint_path=/path/to/checkpoint \
-#     -o execution.hostname=<slurm_host> \
-#     -o execution.account=<slurm_account> \
-#     -o execution.output_dir=/path/to/output \
-#     -o deployment.served_model_name=<model_name>
+# Usage:
+#   cp recipes/env.example .env
+#   # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
+#   set -a && source .env && set +a
+#   nel run --config recipes/tasks/ifbench.yaml \
+#     -o deployment.checkpoint_path=/path/to/checkpoint \
+#     -o execution.hostname=<slurm_host> \
+#     -o execution.account=<slurm_account> \
+#     -o execution.output_dir=/path/to/output \
+#     -o deployment.served_model_name=<model_name>
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+execution:
+  hostname: ???
+  username: ${oc.env:USER}
+  account: ???
+  output_dir: ???
+  walltime: "02:00:00"
+  mounts:
+    mount_home: false
+deployment:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  checkpoint_path: ???
+  hf_model_handle:
+  served_model_name: ???
+  tensor_parallel_size: 1
+  data_parallel_size: 1
+  # For models with custom code, add: --trust-remote-code
+  extra_args: --max-model-len 32768
+evaluation:
+  env_vars:
+    HF_TOKEN: host:HF_TOKEN
+  nemo_evaluator_config:
+    config:
+      params:
+        request_timeout: 3600
+        max_retries: 5
+        parallelism: 16
+    target:
+      api_endpoint:
+        api_key_name: DUMMY_API_KEY
+  tasks:
+    - name: ns_ifbench
+      nemo_evaluator_config:
+        config:
+          params:
+            extra:
+              num_repeats: 8
+        target:
+          api_endpoint:
+            adapter_config:
+              params_to_remove:
+                - max_new_tokens
+                - max_completion_tokens