NVIDIA · jomitchellnv · Mar 12, 2026
@@ -0,0 +1,29 @@
+name: Claude Integration Tests
+on:
+  schedule:
+    - cron: "0 6 * * 1" # Weekly Monday 6am UTC
+  workflow_dispatch:
+  push:
+    paths:
+      - "bionemo-recipes/claude-plugin/**"
+      - "bionemo-recipes/integration-tests/**"
+
+jobs:
+  test:
+    runs-on: linux-amd64-gpu-l4-latest-1
+    container:
+      image: nvcr.io/nvidia/pytorch:25.06-py3
+    env:
+      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install Claude Code CLI
+        run: npm install -g @anthropic-ai/claude-code
+
+      - name: Install test dependencies
+        run: pip install pytest pytest-timeout
+
+      - name: Run integration tests
+        run: cd bionemo-recipes/integration-tests && pytest -v --timeout=600
+        timeout-minutes: 30
@@ -1,4 +1,6 @@
 name: Claude Integration Tests
+permissions:
+  contents: read
 on:
  schedule:
    - cron: "0 6 * * 1" # Weekly Monday 6am UTC
@@ -1,4 +1,6 @@
 name: Claude Integration Tests
+permissions:
+  contents: read
 on:
  schedule:
    - cron: "0 6 * * 1" # Weekly Monday 6am UTC
@@ -0,0 +1,9 @@
+{
+  "name": "bionemo-recipes",
+  "version": "0.1.0",
+  "description": "Convert HuggingFace models to TransformerEngine, add FP8 support, set up distributed training — using NVIDIA BioNeMo Recipes as reference.",
+  "author": { "name": "NVIDIA BioNeMo Team" },
+  "repository": "https://github.com/NVIDIA/bionemo-framework",
+  "license": "Apache-2.0",
+  "keywords": ["transformerengine", "fp8", "fsdp", "distributed-training", "nvidia"]
+}
@@ -0,0 +1,73 @@
+# BioNeMo Recipes Claude Plugin
+
+A Claude Code plugin for converting HuggingFace models to NVIDIA TransformerEngine,
+adding FP8/FP4 quantization support, writing golden value tests, and setting up
+FSDP distributed training. All skills use real BioNeMo Recipes as reference implementations.
+
+## Installation
+
+```bash
+claude --add-dir /path/to/bionemo-recipes/claude-plugin
+```
+
+## Available Skills
+
+| Skill                  | Description                                                                                                                |
+| ---------------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| `/te-convert-model`    | Convert a HuggingFace `PreTrainedModel` to use TransformerEngine layers with bidirectional weight conversion (HF \<-> TE). |
+| `/add-fp8-support`     | Add FP8 or FP4 quantized training support to an existing TransformerEngine model.                                          |
+| `/write-golden-tests`  | Generate golden value tests that verify a TE model produces identical outputs to the original HF reference model.          |
+| `/setup-fsdp-training` | Scaffold a complete FSDP training recipe with Hydra configs, distributed launcher, and Docker environment.                 |
+| `/export-to-hf-hub`    | Create an export script that bundles model weights, tokenizer, and config for publishing to the Hugging Face Hub.          |
+
+## Usage Examples
+
+### Convert a HuggingFace model to TransformerEngine
+
+```
+/te-convert-model facebook/esm2_t33_650M_UR50D
+```
+
+Generates a TE-backed `PreTrainedModel` class with `convert_hf_to_te()` and
+`convert_te_to_hf()` functions, following the pattern in `bionemo-recipes/models/`.
+
+### Add FP8 quantized training
+
+```
+/add-fp8-support --precision fp8
+```
+
+Adds FP8 recipe configuration, `DelayedScaling` setup, and the `fp8_autocast`
+context manager to your training loop.
+
+### Write golden value tests
+
+```
+/write-golden-tests --model esm2 --reference facebook/esm2_t33_650M_UR50D
+```
+
+Creates pytest tests that load both the HF reference and TE model, run a forward
+pass with fixed inputs, and assert outputs match within tolerance.
+
+### Set up FSDP distributed training
+
+```
+/setup-fsdp-training --model esm2 --framework native_te
+```
+
+Scaffolds a self-contained recipe directory with a Dockerfile, training script,
+Hydra configs, and a sample data loader.
+
+### Export model to Hugging Face Hub
+
+```
+/export-to-hf-hub --model esm2
+```
+
+Generates an `export.py` script that packages weights, config, and tokenizer
+files for upload to Hugging Face Hub.
+
+## Links
+
+- [BioNeMo Framework](https://github.com/NVIDIA/bionemo-framework)
+- [BioNeMo Recipes README](../README.md)
@@ -0,0 +1,136 @@
+---
+name: add-fp8-support
+description: >
+  Add FP8, MXFP8, or NVFP4 quantization support to a TransformerEngine model.
+  Triggers when user asks about FP8, FP4, quantization, mixed precision,
+  or low-precision training.
+allowed-tools: Read, Grep, Glob, Write, Edit, Bash, Agent
+argument-hint: '[fp8|mxfp8|nvfp4]'
+---
+
+# Add FP8/FP4 Quantization Support
+
+You are adding quantization support to a TransformerEngine model. Read the reference files first.
+
+## Reference Files
+
+- `reference/quantization.py` — Layer-wise precision assignment
+- `reference/fp8_config_example.py` — FP8 recipe setup in training
+
+## Steps
+
+### 1. Add Config Fields
+
+Add these fields to the NV config class:
+
+- `layer_precision: list[str | None] | None = None` — Per-layer precision ("fp8", "fp4", None)
+- `use_quantized_model_init: bool = False` — Initialize weights directly in quantized format
+
+Validate in `__init__`:
+
+```python
+if layer_precision is not None:
+    assert len(layer_precision) == self.num_hidden_layers
+    for p in layer_precision:
+        assert p in {"fp8", "fp4", None}
+```
+
+### 2. Pad Vocabulary Size
+
+FP8 requires tensor dimensions divisible by 16. Pad vocab:
+
+```python
+self.padded_vocab_size = padded_vocab_size or self.vocab_size
+# Round up to next multiple of 16
+if self.padded_vocab_size % 16 != 0:
+    self.padded_vocab_size = ((self.padded_vocab_size + 15) // 16) * 16
+```
+
+Update embedding and LM head to use `padded_vocab_size`. Truncate logits back to `vocab_size` in forward pass.
+
+### 3. Implement `get_autocast_context()`
+
+This method returns the appropriate TE context manager for each layer:
+
+```python
+from contextlib import nullcontext
+import transformer_engine.pytorch as te
+
+
+def get_autocast_context(self, layer_number, init=False, outer=False):
+    if self.config.layer_precision is None:
+        return nullcontext()
+
+    # Outer context wraps entire encoder for recipe post-processing
+    if outer:
+        if "fp8" not in self.config.layer_precision:
+            return nullcontext()
+        return te.autocast(enabled=True, recipe=self._fp8_recipe)
+
+    precision = self.config.layer_precision[layer_number]
+    recipe = {"fp8": self._fp8_recipe, "fp4": self._fp4_recipe}.get(precision)
+
+    # During init: use quantized_model_init for weight initialization
+    if init and self.config.use_quantized_model_init:
+        if precision in ("fp8", "fp4"):
+            return te.quantized_model_init(recipe=recipe)
+        return nullcontext()
+
+    # During forward: use autocast for precision control
+    if precision in ("fp8", "fp4"):
+        return te.autocast(enabled=True, recipe=recipe)
+    return te.autocast(enabled=False)  # Explicitly disable for BF16 layers
+```
+
+### 4. Use Contexts in Model
+
+During layer creation:
+
+```python
+for i in range(config.num_hidden_layers):
+    with self.get_autocast_context(i, init=True):
+        layers.append(te.TransformerLayer(...))
+```
+
+During forward pass:
+
+```python
+with self.get_autocast_context(None, outer=True):
+    for layer_idx, layer in enumerate(self.layers):
+        with self.get_autocast_context(layer_idx):
+            hidden_states = layer(hidden_states, ...)
+```
+
+### 5. Keep LM Head in Higher Precision
+
+```python
+with te.autocast(enabled=False):
+    logits = self.lm_head(hidden_states)
+```
+
+### 6. Set Up FP8 Recipes
+
+In training script:
+
+```python
+from transformer_engine.common.recipe import DelayedScaling, Format
+
+fp8_recipe = DelayedScaling(fp8_format=Format.HYBRID)
+model = MyTEModel(config, fp8_recipe=fp8_recipe)
+```
+
+Available recipes:
+
+- `DelayedScaling` — Classic FP8, computes scaling factors with delay
+- `Float8CurrentScaling` — Per-tensor current scaling
+- `Float8BlockScaling` — Block-wise scaling (MXFP8)
+- `NVFP4BlockScaling` — 4-bit quantization
+
+### 7. Layer-wise Precision Assignment
+
+Use `resolve_layer_precision()` from reference to assign layers:
+
+```python
+# In config: fp8_layers=[1,2,3], fp4_layers=[4,5,6] (1-indexed)
+# Returns: ["fp8","fp8","fp8","fp4","fp4","fp4"] (0-indexed)
+```
@@ -0,0 +1,66 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-Apache2
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Reference: FP8 recipe setup in a training script.
+
+Shows how to create and use FP8/FP4 recipes with TransformerEngine models.
+"""
+
+from transformer_engine.common.recipe import (
+    DelayedScaling,
+    Float8BlockScaling,
+    Float8CurrentScaling,
+    Format,
+    NVFP4BlockScaling,
+)
+
+
+def create_fp8_recipe(recipe_name: str = "DelayedScaling", **kwargs):
+    """Create an FP8 recipe by name.
+
+    Available recipes:
+    - DelayedScaling: Classic FP8, scaling factors computed with delay
+    - Float8CurrentScaling: Per-tensor scaling computed each step
+    - Float8BlockScaling: Block-wise scaling (MXFP8)
+    - NVFP4BlockScaling: 4-bit quantization
+    """
+    recipes = {
+        "DelayedScaling": DelayedScaling,
+        "Float8CurrentScaling": Float8CurrentScaling,
+        "Float8BlockScaling": Float8BlockScaling,
+        "NVFP4BlockScaling": NVFP4BlockScaling,
+    }
+    recipe_cls = recipes[recipe_name]
+
+    # NOTE: Format.HYBRID uses E4M3 for forward, E5M2 for backward
+    if "fp8_format" not in kwargs and recipe_name != "NVFP4BlockScaling":
+        kwargs["fp8_format"] = Format.HYBRID
+    if "fp4_format" not in kwargs and recipe_name == "NVFP4BlockScaling":
+        kwargs["fp4_format"] = Format.E2M1
+
+    return recipe_cls(**kwargs)
+
+
+# Example usage in training script:
+def setup_model_with_fp8(config, layer_precision):
+    """Example of setting up a TE model with FP8 quantization."""
+    config.layer_precision = layer_precision
+
+    fp8_recipe = create_fp8_recipe("DelayedScaling")
+
+    # NOTE: Pass recipe to model constructor, not as global state
+    # model = NVModelForMaskedLM(config, fp8_recipe=fp8_recipe)
+
+    return config, fp8_recipe
@@ -0,0 +1,69 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-Apache2
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Reference: Layer-wise quantization assignment utilities.
+
+Demonstrates how to resolve user-specified layer lists into per-layer precision assignments.
+"""
+
+
+def resolve_layer_precision(
+    num_layers: int,
+    fp8_enabled: bool,
+    fp4_enabled: bool,
+    fp8_layers: list[int] | None,
+    fp4_layers: list[int] | None,
+) -> list[str | None]:
+    """Resolve layer-wise quantization from user config.
+
+    Takes 1-indexed layer lists and returns 0-indexed precision list.
+
+    Examples:
+        # All layers FP8
+        resolve_layer_precision(6, fp8_enabled=True, fp4_enabled=False, None, None)
+        # -> ["fp8", "fp8", "fp8", "fp8", "fp8", "fp8"]
+
+        # Mixed: layers 1-3 FP8, layers 4-6 FP4
+        resolve_layer_precision(6, True, True, [1,2,3], [4,5,6])
+        # -> ["fp8", "fp8", "fp8", "fp4", "fp4", "fp4"]
+    """
+    all_layers = set(range(1, num_layers + 1))
+
+    if fp8_enabled and fp4_enabled and fp8_layers is None and fp4_layers is None:
+        raise ValueError("Both fp8 and fp4 enabled but no layer lists specified. Provide explicit layer assignments.")
+
+    # Auto-fill: if one format has explicit layers, other gets remaining
+    if fp8_enabled and fp8_layers is None:
+        claimed = set(fp4_layers) if fp4_layers else set()
+        fp8_layers = sorted(all_layers - claimed)
+
+    if fp4_enabled and fp4_layers is None:
+        claimed = set(fp8_layers) if fp8_layers else set()
+        fp4_layers = sorted(all_layers - claimed)
+
+    if not fp8_enabled:
+        fp8_layers = None
+    if not fp4_enabled:
+        fp4_layers = None
+
+    # Validate no overlap
+    if fp8_layers and fp4_layers:
+        overlap = set(fp8_layers) & set(fp4_layers)
+        if overlap:
+            raise ValueError(f"Overlapping layers: {overlap}")
+
+    fp8_set = set(fp8_layers) if fp8_layers else set()
+    fp4_set = set(fp4_layers) if fp4_layers else set()
+    return ["fp8" if i in fp8_set else "fp4" if i in fp4_set else None for i in range(1, num_layers + 1)]