Recipes SOL Perf Baselines by gagank1 · Pull Request #1548 · NVIDIA/bionemo-framework

gagank1 · 2026-04-04T05:43:01Z

Description

Usage

TODO: Add code snippet

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

copy-pr-bot · 2026-04-04T05:43:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-04T05:43:09Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8348ad57-a0a3-4e40-a415-c221c708647a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch gkaushik/mfu_experiment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pstjohn · 2026-04-04T11:17:50Z

bionemo-recipes/recipes/llama3_native_te/compare_mfu_multigpu.py

+    if rank == 0:
+        print("Counting FLOPs with HF model (meta device)...")
+    hf_config_meta = LlamaConfig.from_pretrained(args.config_path)
+    hf_config_meta._attn_implementation = "eager"


This is going to be the slowest possible attention, can you use fa2 or sdpa?

- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP execution for both TE and HF models, validating loss, logits (cosine sim > 0.99), and gradients (cosine sim > 0.8) following the pattern from models/llama3/tests/test_cp_bshd.py - Fix silent RoPE bug in HF CP: position_ids were not passed to context_parallel buffers, causing each rank to auto-generate [0..S/CP-1] instead of correct global positions - Switch FLOPs counting from eager to SDPA attention for consistency with the actual training implementation (identical counts on meta device) - Add exact FLOPs output (full integers with commas) alongside abbreviated values in both single-GPU and multi-GPU scripts - Switch bandwidth measurement from all-reduce to all-gather for more accurate pure data movement measurement matching CP ring attention - Add position_ids and max_length_q/k support to measure_step_time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Replaces all-gather with explicit send/recv between rank 0 and rank 1, matching CP ring attention's actual communication pattern. Measures 6.6 GB/s unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

Adds subcommands for using MFU utilities outside of full training scripts: - gpu-info: Print GPU name, detected peak TFLOPS, and known GPU table - flops: Compute FLOPs from a model config (README + first-principles) - cp-comm: Estimate CP ring attention communication volume - bandwidth: Measure unidirectional P2P bandwidth via send/recv (torchrun) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

llama3 mfu experiment

c761be5

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 self-assigned this Apr 4, 2026

pstjohn reviewed Apr 4, 2026

View reviewed changes

gagank1 and others added 3 commits April 6, 2026 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipes SOL Perf Baselines#1548

Recipes SOL Perf Baselines#1548
gagank1 wants to merge 4 commits intomainfrom
gkaushik/mfu_experiment

gagank1 commented Apr 4, 2026

Uh oh!

copy-pr-bot bot commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

Review skipped

Uh oh!

pstjohn Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gagank1 commented Apr 4, 2026

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Apr 4, 2026

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

pstjohn Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 4, 2026 •

edited

Loading