Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/integration-tests-claude.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Claude Integration Tests
on:
schedule:
- cron: "0 6 * * 1" # Weekly Monday 6am UTC
workflow_dispatch:
push:
paths:
- "bionemo-recipes/claude-plugin/**"
- "bionemo-recipes/integration-tests/**"

jobs:
test:
runs-on: linux-amd64-gpu-l4-latest-1
container:
image: nvcr.io/nvidia/pytorch:25.06-py3
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4

- name: Install Claude Code CLI
run: npm install -g @anthropic-ai/claude-code

- name: Install test dependencies
run: pip install pytest pytest-timeout

- name: Run integration tests
run: cd bionemo-recipes/integration-tests && pytest -v --timeout=600
timeout-minutes: 30
Comment on lines +13 to +29

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI 28 days ago

In general, the fix is to explicitly define a permissions block in the workflow or job to restrict the GITHUB_TOKEN to the least privileges needed. This job only checks out code and runs tests, so it should only require read access to repository contents.

The best fix without changing functionality is to add a top-level permissions section (so it applies to all jobs) immediately after the name: declaration in .github/workflows/integration-tests-claude.yml, specifying contents: read. This matches the minimal suggestion from CodeQL and GitHub, and does not interfere with the existing steps (actions/checkout, npm install, pip install, pytest, all of which run locally in the container). No new imports or external dependencies are required; we are only changing the YAML configuration of the workflow.

Concretely:

  • Edit .github/workflows/integration-tests-claude.yml.
  • Insert:
permissions:
  contents: read

after line 1 (name: Claude Integration Tests) and before the on: block. No other lines need to be modified.

Suggested changeset 1
.github/workflows/integration-tests-claude.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/integration-tests-claude.yml b/.github/workflows/integration-tests-claude.yml
--- a/.github/workflows/integration-tests-claude.yml
+++ b/.github/workflows/integration-tests-claude.yml
@@ -1,4 +1,6 @@
 name: Claude Integration Tests
+permissions:
+  contents: read
 on:
   schedule:
     - cron: "0 6 * * 1" # Weekly Monday 6am UTC
EOF
@@ -1,4 +1,6 @@
name: Claude Integration Tests
permissions:
contents: read
on:
schedule:
- cron: "0 6 * * 1" # Weekly Monday 6am UTC
Copilot is powered by AI and may make mistakes. Always verify output.
9 changes: 9 additions & 0 deletions bionemo-recipes/claude-plugin/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "bionemo-recipes",
"version": "0.1.0",
"description": "Convert HuggingFace models to TransformerEngine, add FP8 support, set up distributed training — using NVIDIA BioNeMo Recipes as reference.",
"author": { "name": "NVIDIA BioNeMo Team" },
"repository": "https://github.com/NVIDIA/bionemo-framework",
"license": "Apache-2.0",
"keywords": ["transformerengine", "fp8", "fsdp", "distributed-training", "nvidia"]
}
73 changes: 73 additions & 0 deletions bionemo-recipes/claude-plugin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# BioNeMo Recipes Claude Plugin

A Claude Code plugin for converting HuggingFace models to NVIDIA TransformerEngine,
adding FP8/FP4 quantization support, writing golden value tests, and setting up
FSDP distributed training. All skills use real BioNeMo Recipes as reference implementations.

## Installation

```bash
claude --add-dir /path/to/bionemo-recipes/claude-plugin
```

## Available Skills

| Skill | Description |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `/te-convert-model` | Convert a HuggingFace `PreTrainedModel` to use TransformerEngine layers with bidirectional weight conversion (HF \<-> TE). |
| `/add-fp8-support` | Add FP8 or FP4 quantized training support to an existing TransformerEngine model. |
| `/write-golden-tests` | Generate golden value tests that verify a TE model produces identical outputs to the original HF reference model. |
| `/setup-fsdp-training` | Scaffold a complete FSDP training recipe with Hydra configs, distributed launcher, and Docker environment. |
| `/export-to-hf-hub` | Create an export script that bundles model weights, tokenizer, and config for publishing to the Hugging Face Hub. |

## Usage Examples

### Convert a HuggingFace model to TransformerEngine

```
/te-convert-model facebook/esm2_t33_650M_UR50D
```

Generates a TE-backed `PreTrainedModel` class with `convert_hf_to_te()` and
`convert_te_to_hf()` functions, following the pattern in `bionemo-recipes/models/`.

### Add FP8 quantized training

```
/add-fp8-support --precision fp8
```

Adds FP8 recipe configuration, `DelayedScaling` setup, and the `fp8_autocast`
context manager to your training loop.

### Write golden value tests

```
/write-golden-tests --model esm2 --reference facebook/esm2_t33_650M_UR50D
```

Creates pytest tests that load both the HF reference and TE model, run a forward
pass with fixed inputs, and assert outputs match within tolerance.

### Set up FSDP distributed training

```
/setup-fsdp-training --model esm2 --framework native_te
```

Scaffolds a self-contained recipe directory with a Dockerfile, training script,
Hydra configs, and a sample data loader.

### Export model to Hugging Face Hub

```
/export-to-hf-hub --model esm2
```

Generates an `export.py` script that packages weights, config, and tokenizer
files for upload to Hugging Face Hub.

## Links

- [BioNeMo Framework](https://github.com/NVIDIA/bionemo-framework)
- [BioNeMo Recipes README](../README.md)
136 changes: 136 additions & 0 deletions bionemo-recipes/claude-plugin/skills/add-fp8-support/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
name: add-fp8-support
description: >
Add FP8, MXFP8, or NVFP4 quantization support to a TransformerEngine model.
Triggers when user asks about FP8, FP4, quantization, mixed precision,
or low-precision training.
allowed-tools: Read, Grep, Glob, Write, Edit, Bash, Agent
argument-hint: '[fp8|mxfp8|nvfp4]'
---

# Add FP8/FP4 Quantization Support

You are adding quantization support to a TransformerEngine model. Read the reference files first.

## Reference Files

- `reference/quantization.py` — Layer-wise precision assignment
- `reference/fp8_config_example.py` — FP8 recipe setup in training

## Steps

### 1. Add Config Fields

Add these fields to the NV config class:

- `layer_precision: list[str | None] | None = None` — Per-layer precision ("fp8", "fp4", None)
- `use_quantized_model_init: bool = False` — Initialize weights directly in quantized format

Validate in `__init__`:

```python
if layer_precision is not None:
assert len(layer_precision) == self.num_hidden_layers
for p in layer_precision:
assert p in {"fp8", "fp4", None}
```

### 2. Pad Vocabulary Size

FP8 requires tensor dimensions divisible by 16. Pad vocab:

```python
self.padded_vocab_size = padded_vocab_size or self.vocab_size
# Round up to next multiple of 16
if self.padded_vocab_size % 16 != 0:
self.padded_vocab_size = ((self.padded_vocab_size + 15) // 16) * 16
```

Update embedding and LM head to use `padded_vocab_size`. Truncate logits back to `vocab_size` in forward pass.

### 3. Implement `get_autocast_context()`

This method returns the appropriate TE context manager for each layer:

```python
from contextlib import nullcontext
import transformer_engine.pytorch as te


def get_autocast_context(self, layer_number, init=False, outer=False):
if self.config.layer_precision is None:
return nullcontext()

# Outer context wraps entire encoder for recipe post-processing
if outer:
if "fp8" not in self.config.layer_precision:
return nullcontext()
return te.autocast(enabled=True, recipe=self._fp8_recipe)

precision = self.config.layer_precision[layer_number]
recipe = {"fp8": self._fp8_recipe, "fp4": self._fp4_recipe}.get(precision)

# During init: use quantized_model_init for weight initialization
if init and self.config.use_quantized_model_init:
if precision in ("fp8", "fp4"):
return te.quantized_model_init(recipe=recipe)
return nullcontext()

# During forward: use autocast for precision control
if precision in ("fp8", "fp4"):
return te.autocast(enabled=True, recipe=recipe)
return te.autocast(enabled=False) # Explicitly disable for BF16 layers
```

### 4. Use Contexts in Model

During layer creation:

```python
for i in range(config.num_hidden_layers):
with self.get_autocast_context(i, init=True):
layers.append(te.TransformerLayer(...))
```

During forward pass:

```python
with self.get_autocast_context(None, outer=True):
for layer_idx, layer in enumerate(self.layers):
with self.get_autocast_context(layer_idx):
hidden_states = layer(hidden_states, ...)
```

### 5. Keep LM Head in Higher Precision

```python
with te.autocast(enabled=False):
logits = self.lm_head(hidden_states)
```

### 6. Set Up FP8 Recipes

In training script:

```python
from transformer_engine.common.recipe import DelayedScaling, Format

fp8_recipe = DelayedScaling(fp8_format=Format.HYBRID)
model = MyTEModel(config, fp8_recipe=fp8_recipe)
```

Available recipes:

- `DelayedScaling` — Classic FP8, computes scaling factors with delay
- `Float8CurrentScaling` — Per-tensor current scaling
- `Float8BlockScaling` — Block-wise scaling (MXFP8)
- `NVFP4BlockScaling` — 4-bit quantization

### 7. Layer-wise Precision Assignment

Use `resolve_layer_precision()` from reference to assign layers:

```python
# In config: fp8_layers=[1,2,3], fp4_layers=[4,5,6] (1-indexed)
# Returns: ["fp8","fp8","fp8","fp4","fp4","fp4"] (0-indexed)
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: LicenseRef-Apache2
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Reference: FP8 recipe setup in a training script.

Shows how to create and use FP8/FP4 recipes with TransformerEngine models.
"""

from transformer_engine.common.recipe import (
DelayedScaling,
Float8BlockScaling,
Float8CurrentScaling,
Format,
NVFP4BlockScaling,
)


def create_fp8_recipe(recipe_name: str = "DelayedScaling", **kwargs):
"""Create an FP8 recipe by name.

Available recipes:
- DelayedScaling: Classic FP8, scaling factors computed with delay
- Float8CurrentScaling: Per-tensor scaling computed each step
- Float8BlockScaling: Block-wise scaling (MXFP8)
- NVFP4BlockScaling: 4-bit quantization
"""
recipes = {
"DelayedScaling": DelayedScaling,
"Float8CurrentScaling": Float8CurrentScaling,
"Float8BlockScaling": Float8BlockScaling,
"NVFP4BlockScaling": NVFP4BlockScaling,
}
recipe_cls = recipes[recipe_name]

# NOTE: Format.HYBRID uses E4M3 for forward, E5M2 for backward
if "fp8_format" not in kwargs and recipe_name != "NVFP4BlockScaling":
kwargs["fp8_format"] = Format.HYBRID
if "fp4_format" not in kwargs and recipe_name == "NVFP4BlockScaling":
kwargs["fp4_format"] = Format.E2M1

return recipe_cls(**kwargs)


# Example usage in training script:
def setup_model_with_fp8(config, layer_precision):
"""Example of setting up a TE model with FP8 quantization."""
config.layer_precision = layer_precision

fp8_recipe = create_fp8_recipe("DelayedScaling")

# NOTE: Pass recipe to model constructor, not as global state
# model = NVModelForMaskedLM(config, fp8_recipe=fp8_recipe)

return config, fp8_recipe
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: LicenseRef-Apache2
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Reference: Layer-wise quantization assignment utilities.

Demonstrates how to resolve user-specified layer lists into per-layer precision assignments.
"""


def resolve_layer_precision(
num_layers: int,
fp8_enabled: bool,
fp4_enabled: bool,
fp8_layers: list[int] | None,
fp4_layers: list[int] | None,
) -> list[str | None]:
"""Resolve layer-wise quantization from user config.

Takes 1-indexed layer lists and returns 0-indexed precision list.

Examples:
# All layers FP8
resolve_layer_precision(6, fp8_enabled=True, fp4_enabled=False, None, None)
# -> ["fp8", "fp8", "fp8", "fp8", "fp8", "fp8"]

# Mixed: layers 1-3 FP8, layers 4-6 FP4
resolve_layer_precision(6, True, True, [1,2,3], [4,5,6])
# -> ["fp8", "fp8", "fp8", "fp4", "fp4", "fp4"]
"""
all_layers = set(range(1, num_layers + 1))

if fp8_enabled and fp4_enabled and fp8_layers is None and fp4_layers is None:
raise ValueError("Both fp8 and fp4 enabled but no layer lists specified. Provide explicit layer assignments.")

# Auto-fill: if one format has explicit layers, other gets remaining
if fp8_enabled and fp8_layers is None:
claimed = set(fp4_layers) if fp4_layers else set()
fp8_layers = sorted(all_layers - claimed)

if fp4_enabled and fp4_layers is None:
claimed = set(fp8_layers) if fp8_layers else set()
fp4_layers = sorted(all_layers - claimed)

if not fp8_enabled:
fp8_layers = None
if not fp4_enabled:
fp4_layers = None

# Validate no overlap
if fp8_layers and fp4_layers:
overlap = set(fp8_layers) & set(fp4_layers)
if overlap:
raise ValueError(f"Overlapping layers: {overlap}")

fp8_set = set(fp8_layers) if fp8_layers else set()
fp4_set = set(fp4_layers) if fp4_layers else set()
return ["fp8" if i in fp8_set else "fp4" if i in fp4_set else None for i in range(1, num_layers + 1)]
Loading
Loading