Draft
Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Contributor
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
pstjohn
reviewed
Apr 4, 2026
| if rank == 0: | ||
| print("Counting FLOPs with HF model (meta device)...") | ||
| hf_config_meta = LlamaConfig.from_pretrained(args.config_path) | ||
| hf_config_meta._attn_implementation = "eager" |
Collaborator
There was a problem hiding this comment.
This is going to be the slowest possible attention, can you use fa2 or sdpa?
- Add compare_mfu_validate.py: golden value tests comparing CP vs non-CP execution for both TE and HF models, validating loss, logits (cosine sim > 0.99), and gradients (cosine sim > 0.8) following the pattern from models/llama3/tests/test_cp_bshd.py - Fix silent RoPE bug in HF CP: position_ids were not passed to context_parallel buffers, causing each rank to auto-generate [0..S/CP-1] instead of correct global positions - Switch FLOPs counting from eager to SDPA attention for consistency with the actual training implementation (identical counts on meta device) - Add exact FLOPs output (full integers with commas) alongside abbreviated values in both single-GPU and multi-GPU scripts - Switch bandwidth measurement from all-reduce to all-gather for more accurate pure data movement measurement matching CP ring attention - Add position_ids and max_length_q/k support to measure_step_time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Replaces all-gather with explicit send/recv between rank 0 and rank 1, matching CP ring attention's actual communication pattern. Measures 6.6 GB/s unidirectional on PCIe Gen 3 x8 (vs 3.2 GB/s all-gather, 4.0 GB/s all-reduce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Adds subcommands for using MFU utilities outside of full training scripts: - gpu-info: Print GPU name, detected peak TFLOPS, and known GPU table - flops: Compute FLOPs from a model config (README + first-principles) - cp-comm: Estimate CP ring attention communication volume - bandwidth: Measure unidirectional P2P bandwidth via send/recv (torchrun) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist