Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 32 additions & 1 deletion .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Test that `nel` is installed with `nel --version`. If not, instruct the user to

If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing `???` values, quantization flags) before running.

**Shortcut: use a pre-built recipe.** If the user asks for a specific benchmark (e.g., "run MMLU-Pro", "evaluate with AIME"), check `recipes/tasks/` (relative to this skill's directory) for a matching recipe. Available: mmlu, mmlu_pro, gpqa, aime2025, livecodebench, ifbench, scicode. If found, skip Steps 2 and 5 (config generation and task confirmation), but still do Step 3 (auto-detect model settings from checkpoint) and Step 4 (fill in required `???` values from user input), then proceed to Step 7.5/8.

**Step 2: Build the base config file**

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
Expand Down Expand Up @@ -123,6 +125,29 @@ If no `hf_quant_config.json`, also check `config.json` for a `quantization_confi

> **Note:** Some models require additional env vars for deployment (e.g., `VLLM_NVFP4_GEMM_BACKEND=marlin` for Nemotron Super). These are not in `hf_quant_config.json` — they are discovered during model card research below.

**Auto-detect deployment settings from checkpoint:**

Read `config.json` from the checkpoint (or HF model card) and build `deployment.extra_args` dynamically:

```bash
cat <checkpoint_path>/config.json 2>/dev/null
```

| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072` → `--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |

Comment on lines +142 to +146
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In vLLM, what are the security implications of using --trust-remote-code, and is it recommended to enable it by default based only on Hugging Face config.json auto_map?

💡 Result:

Using --trust-remote-code in vLLM allows execution of untrusted Python code from Hugging Face model repositories during model loading and tokenizer initialization. This poses severe security risks, including remote code execution (RCE), as attackers can craft malicious models that execute arbitrary code on the host system when loaded. Multiple CVEs highlight this danger, even when the flag is intended to be False: - CVE-2025-66448 and GHSA-8fr4-5q9j-m8gm: auto_map in config.json triggers get_class_from_dynamic_module, fetching and executing remote code bypassing trust_remote_code=False (fixed in v0.11.1+ with domain validation). - CVE-2026-27893 and GHSA-7972-pg2x-xr59: Hardcoded trust_remote_code=True in model files (e.g., Nemotron-VL, Kimi-K25) overrides user setting (fixed in v0.18.0+). - GHSA-2pc9-4j83-qjmr: Unconditional auto_map resolution without trust_remote_code check (fixed in v0.14.0). Official vLLM docs confirm the flag defaults to False in both LLM class and vllm serve CLI, explicitly for trusting remote code from Hugging Face. It is not recommended to enable --trust-remote-code by default, especially based solely on config.json auto_map, as this field has been exploited for RCE. Only enable explicitly for trusted models requiring custom code, after verifying the repository. Use the latest vLLM version (e.g., v0.20.0 as of 2026-04-27) with patches applied, and prefer models without remote code needs.

Citations:


Don't auto-enable --trust-remote-code based on auto_map without explicit user confirmation.

Automatically enabling this flag when auto_map exists in config.json creates a remote code execution (RCE) risk. The vLLM security advisories (CVE-2025-66448, CVE-2026-27893, GHSA-8fr4-5q9j-m8gm) document multiple instances where attackers exploited auto_map to execute arbitrary code during model loading. Official vLLM documentation explicitly recommends keeping this flag disabled by default. Only enable after explicit user confirmation and verification that the model is from a trusted source.

Suggested wording adjustment
-| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
+| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | `--trust-remote-code` | Only add if model has custom code |
| Field in `config.json` | What to set | Example |
| --- | --- | --- |
| `max_position_embeddings` | `--max-model-len <value>` | `131072``--max-model-len 131072` |
| `auto_map` exists | Ask user to explicitly confirm `--trust-remote-code` | Explain this allows execution of model-provided remote code; add only after user confirms and verifies model trustworthiness |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/SKILL.md around lines 136 - 140, The doc currently
implies automatically enabling --trust-remote-code when auto_map exists; change
this to explicitly warn against auto-enabling and instruct readers to require
explicit user confirmation before setting --trust-remote-code. Update the table
row referencing `auto_map` and `--trust-remote-code` to state “Do not enable by
default; require explicit confirmation and verification of model provenance
(trusted source)”, and add a short advisory note referencing vLLM security best
practices and the RCE risks (auto_map) so maintainers/users must opt-in after
verification.

Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings:

| Model card signal | What to set |
| --- | --- |
| Reasoning model (thinking/CoT) | `--reasoning-parser` and `--reasoning-parser-plugin` if a custom parser is provided |
| Tool-calling support | `--enable-auto-tool-choice --tool-call-parser <parser>` |
| Custom vLLM flags documented | Add as specified (e.g., `--mamba_ssm_cache_dtype float32`) |

Combine all detected flags into a single `deployment.extra_args` override. The recipe's default `--max-model-len 32768` is a fallback — always prefer the value from `config.json`.

**Quantization-aware benchmark defaults:**

When a quantized checkpoint is detected, read `references/quantization-benchmarks.md` for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
Expand Down Expand Up @@ -218,7 +243,13 @@ ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"

Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.

**Important**: Export required environment variables based on your config. If any tokens or keys are missing (e.g. `HF_TOKEN`, `NGC_API_KEY`, `api_key_name` from the config), ask the user to put them in a `.env` file in the project root so you can run `set -a && source .env && set +a` (or equivalent) before executing `nel run` commands.
**Important**: Export required environment variables based on your config. If any tokens or keys are missing, point the user to `recipes/env.example` — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it:

```bash
cp recipes/env.example .env
# Edit .env with your keys
set -a && source .env && set +a
```

```bash
# If using pre_cmd or post_cmd (review pre_cmd content before enabling — it runs arbitrary commands):
Expand Down
29 changes: 29 additions & 0 deletions .claude/skills/evaluation/recipes/env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Evaluation API Keys
#
# Copy this file and fill in the keys you need:
# cp recipes/env.example .env
# # Edit .env with your keys
# set -a && source .env && set +a
#
# Not all keys are required — only fill in what your tasks need.

# Required for all tasks (model/dataset downloads)
HF_TOKEN=hf_...

# Required for nemo_skills.* tasks (dummy value, not a real key)
DUMMY_API_KEY=dummy

# Required for NEL pre_cmd execution
NEMO_EVALUATOR_TRUST_PRE_CMD=1

# --- Optional: task-specific keys ---

# AIME 2025 (simple_evals variant only, not ns_aime2025)
# JUDGE_API_KEY=

# tau2_bench_telecom (LLM judge)
# JUDGE_API_KEY_NVDEV_QWEN235B=

# terminal-bench-hard (AWS sandbox)
# AWS_ACCESS_KEY_ID=
# AWS_SECRET_ACCESS_KEY=
108 changes: 108 additions & 0 deletions .claude/skills/evaluation/recipes/examples/example_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Example: Quantization Validation Suite
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEL config yaml may change quite frequently. Is this yaml for demo purpose or for day 0 model evals (internal usage)?

Also some of the evals requires pinned eval docker image and specific settings for apple-to-apple comparison.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These recipes are for demo purpose. The task snippets that the agent composes into a working config. If NEL configs change and something breaks, the agent will diagnose and fix the incompatibility at runtime.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For evals that require pinned docker images and specific settings for apple-to-apple comparison, users can override container.

#
# A balanced set of benchmarks for validating quantized model quality.
# Copy this file and customize for your needs.
#
# Includes:
# - MMLU-Pro (knowledge, completions)
# - GPQA Diamond (reasoning, chat, 5 repeats)
# - LiveCodeBench v6 (code, chat, 3 repeats)
# - IFBench (instruction following, chat, 8 repeats)
#
# Usage:
# nel run --config recipes/examples/example_eval.yaml \
# -o deployment.checkpoint_path=/path/to/quantized/checkpoint \
# -o deployment.served_model_name=my-model-nvfp4 \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output
#
# For quantized checkpoints, also add the quantization flag:
# -o 'deployment.extra_args=--max-model-len 32768 --trust-remote-code --quantization modelopt_fp4'
#
# Run a single task:
# nel run --config ... -t ns_gpqa
#
# Smoke test (2 samples):
# nel run --config ... -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=2
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: ???
username: ${oc.env:USER}
account: ???
output_dir: ???
walltime: "04:00:00"
mounts:
mount_home: false
deployment:
env_vars:
HF_TOKEN: host:HF_TOKEN
checkpoint_path: ???
hf_model_handle:
served_model_name: ???
tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we let the agent decide the extra_args required based on the model card/config of the checkpoint? e.g., model-len, tool-call-parser, reasoning-parser ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section for auto-detecting deployment settings from checkpoint.

evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
nemo_evaluator_config:
config:
params:
request_timeout: 3600
max_retries: 10
parallelism: 16
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
# Knowledge (completions endpoint, short)
- name: adlr_mmlu_pro_5_shot_base

# Reasoning (chat endpoint, 5 repeats, short)
- name: ns_gpqa
nemo_evaluator_config:
config:
params:
extra:
args: ++prompt_config=eval/aai/mcq-4choices
num_repeats: 5
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens

# Code (chat endpoint, 3 repeats, medium)
- name: ns_livecodebench
nemo_evaluator_config:
config:
params:
extra:
dataset_split: test_v6_2408_2505
num_repeats: 3
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens

# Instruction following (chat endpoint, 8 repeats, super short)
- name: ns_ifbench
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 8
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
65 changes: 65 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/aime2025.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# AIME 2025 (NeMo Skills, chat)
#
# Math competition benchmark. Uses the chat endpoint.
# Primary metric: pass@1[avg-of-16] symbolic_correct
# Run time: Long (reasoning models generate lengthy thinking traces)
# Repeats: 16
#
# Note: The AA variant (simple_evals.AIME_2025) requires JUDGE_API_KEY
# for LLM-based scoring. This NeMo Skills variant uses symbolic scoring
# and does not require external API keys.
#
# Usage:
# nel run --config recipes/tasks/aime2025.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: ???
username: ${oc.env:USER}
account: ???
output_dir: ???
walltime: "04:00:00"
mounts:
mount_home: false
deployment:
env_vars:
HF_TOKEN: host:HF_TOKEN
checkpoint_path: ???
hf_model_handle:
served_model_name: ???
tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
nemo_evaluator_config:
config:
params:
request_timeout: 100000
max_retries: 10
parallelism: 16
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for keeping one yaml file for each task? Can we put them together and let AI agent compose the target set of benchmarks? Also, for other benchmarks like tau2? Can AI agent compose a working config without an example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to let the agent compose the target set of benchmarks. So it's more flexible to run a single task directly or compose them into a suite by copying recipes/examples/example_eval.yaml. Keep one working config may not be flexible since some configs are not needed by some users.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think if we only keep the tasks part for these tasks/< benchmark>.yaml since other config should be the same across benchmarks? It will reduce token usage and keep the other setup consistent across benchmarks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion. I've stripped each task file to just the task config, and create one shared base config.

- name: ns_aime2025
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 16
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
62 changes: 62 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/gpqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# GPQA Diamond (NeMo Skills, chat)
#
# Graduate-level reasoning benchmark. Uses the chat endpoint.
# Primary metric: pass@1[avg-of-5] symbolic_correct
# Run time: Short
# Repeats: 5
#
# Usage:
# nel run --config recipes/tasks/gpqa.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: ???
username: ${oc.env:USER}
account: ???
output_dir: ???
walltime: "02:00:00"
mounts:
mount_home: false
deployment:
env_vars:
HF_TOKEN: host:HF_TOKEN
checkpoint_path: ???
hf_model_handle:
served_model_name: ???
tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
nemo_evaluator_config:
config:
params:
request_timeout: 3600
max_retries: 5
parallelism: 16
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
- name: ns_gpqa
nemo_evaluator_config:
config:
params:
extra:
args: ++prompt_config=eval/aai/mcq-4choices
num_repeats: 5
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
61 changes: 61 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/ifbench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# IFBench (NeMo Skills, chat)
#
# Instruction following benchmark. Uses the chat endpoint.
# Primary metric: pass@1[avg-of-8] prompt_strict_accuracy
# Run time: Super Short
# Repeats: 8
#
# Usage:
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Usage example misses required environment setup.

The command snippet omits HF_TOKEN/DUMMY_API_KEY setup even though this recipe depends on them (Lines 29, 39, 48). A direct copy-paste run can fail on auth.

Suggested usage block update
 # Usage:
+#   cp recipes/env.example .env
+#   # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
+#   set -a && source .env && set +a
 #   nel run --config recipes/tasks/ifbench.yaml \
 #     -o deployment.checkpoint_path=/path/to/checkpoint \
 #     -o execution.hostname=<slurm_host> \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Usage:
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
# Usage:
# cp recipes/env.example .env
# # Edit .env with required keys (HF_TOKEN, DUMMY_API_KEY, ...)
# set -a && source .env && set +a
# nel run --config recipes/tasks/ifbench.yaml \
# -o deployment.checkpoint_path=/path/to/checkpoint \
# -o execution.hostname=<slurm_host> \
# -o execution.account=<slurm_account> \
# -o execution.output_dir=/path/to/output \
# -o deployment.served_model_name=<model_name>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/recipes/tasks/ifbench.yaml around lines 8 - 14,
The usage snippet in ifbench.yaml omits required auth environment variables
(HF_TOKEN and DUMMY_API_KEY) referenced elsewhere (the recipe's
deployment/serving steps around the blocks that read HF_TOKEN at lines ~29,
DUMMY_API_KEY at ~39 and ~48); update the Usage example to show setting these
before running (either export HF_TOKEN=... and export DUMMY_API_KEY=... or
prefix the nel run command with HF_TOKEN=... DUMMY_API_KEY=...), so users have
the required credentials available when invoking the recipe.

defaults:
- execution: slurm/default
- deployment: vllm
- _self_
execution:
hostname: ???
username: ${oc.env:USER}
account: ???
output_dir: ???
walltime: "02:00:00"
mounts:
mount_home: false
deployment:
env_vars:
HF_TOKEN: host:HF_TOKEN
checkpoint_path: ???
hf_model_handle:
served_model_name: ???
tensor_parallel_size: 1
data_parallel_size: 1
# For models with custom code, add: --trust-remote-code
extra_args: --max-model-len 32768
evaluation:
env_vars:
HF_TOKEN: host:HF_TOKEN
nemo_evaluator_config:
config:
params:
request_timeout: 3600
max_retries: 5
parallelism: 16
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
tasks:
- name: ns_ifbench
nemo_evaluator_config:
config:
params:
extra:
num_repeats: 8
target:
api_endpoint:
adapter_config:
params_to_remove:
- max_new_tokens
- max_completion_tokens
Loading
Loading