Skip to content

fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949

Open
mashraf-222 wants to merge 7 commits intomainfrom
cf-1082-benchmark-noise-floor
Open

fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949
mashraf-222 wants to merge 7 commits intomainfrom
cf-1082-benchmark-noise-floor

Conversation

@mashraf-222
Copy link
Copy Markdown
Contributor

Problem

The benchmarking pipeline accepts benchmark noise as real speedups, creating PRs for provable no-ops with claimed speedups of 7-151%. Two compounding issues:

  1. Min-based timing aggregationtotal_passed_runtime() uses min() across loop iterations, biasing toward outlier-fast runs (lucky GC, perfect cache alignment)
  2. Static noise floorspeedup_critic() uses a fixed 5%/15% threshold that doesn't adapt to observed benchmark variance

Evidence: 4 commons-lang PRs added unused fields with 0 code changes to target methods yet reported speedups. Guava PRs #2, #22, #24 claimed 16-37% for JIT no-ops.

Root Cause

  • models.py:total_passed_runtime()min() picks the fastest outlier, not representative performance
  • critic.py:speedup_critic() — fixed noise floor ignores actual baseline variance (a 5% threshold is meaningless when baseline has 20% variance)

Fix

  1. Median aggregation: Replace min() with statistics.median_low() in total_passed_runtime()median_low returns an actual data point (int), more representative of typical performance
  2. Variance-aware noise floor: Add timing_coefficient_of_variation() to TestResults and pass the baseline CV to speedup_critic(). If baseline CV exceeds the static noise floor, the noise floor is raised to match: noise_floor = max(noise_floor, baseline_cv)
  3. Function optimizer integration: Compute baseline CV from benchmarking_test_results and pass it through to the critic

Validation

  • 14/14 critic tests pass (3 new for median, 1 for CV, 1 for CV-aware critic)
  • Java test_run_and_parse assertions updated from min to median expectations
  • uv run prek passes

Test Coverage

New tests:

  • test_speedup_critic_with_baseline_cv — verifies CV raises noise floor
  • test_total_passed_runtime_uses_median — verifies median_low aggregation
  • test_total_passed_runtime_median_even_count — verifies median_low for even data
  • test_timing_coefficient_of_variation — verifies CV calculation

Updated tests:

  • test_run_and_parse.py — assertions changed from "min of runtimes" to "median of runtimes"

Closes CF-1082

…mark false positives

Replace min() with median_low() for timing aggregation in total_passed_runtime(),
reducing sensitivity to outlier-fast iterations that produce spurious speedups.

Add timing_coefficient_of_variation() to measure baseline benchmark variance and
adapt the speedup critic noise floor: if observed variance exceeds the static
threshold, raise the noise floor to match, preventing acceptance of noise as real gains.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 1, 2026

Claude finished @mashraf-222's task in 5m 23s —— View job


PR Review Summary

  • Triage PR size (LARGE — 51 lines of production code)
  • Lint and typecheck
  • Resolve stale threads (none found)
  • Review code changes
  • Duplicate detection
  • Test coverage

Prek Checks

Passed — no linting or formatting issues.

Code Review

1. Stale comment not updated — models.py:949

# can track mins here and only sums can be return in total_passed_runtime

This comment still says "mins" but the PR changed the aggregation to median. A future reader will be confused about why the code computes a median while the comment says min. Fix this →

2. import statistics inside function bodies — models.py:973, 991

Both total_passed_runtime() and timing_coefficient_of_variation() do import statistics inside the function rather than at module level. statistics is a stdlib module — the standard convention is a top-level import. Python does cache it, but this pattern is unusual and these functions are in a hot path (called per candidate per benchmark). Fix this →

3. CV-based noise floor is a no-op when loop_count=1

timing_coefficient_of_variation() skips any test case with len(runtimes) < 2 and returns 0.0 if no test case qualifies. For single-iteration benchmarks (loop_count=1), every test case has exactly one runtime, so per_test_cvs is always empty and 0.0 is returned — the noise floor never adapts.

This is worth calling out explicitly: the CV component of the fix only activates when benchmarks run with ≥2 loop iterations per test case. If the false-positive PRs cited in the PR description were generated with loop_count=1, the CV adjustment won't help those cases.

4. Logic is correct — no bugs

The condition baseline_timing_cv > noise_floor correctly ensures CV only raises the floor, never lowers an already-higher one (e.g., CI-doubled floors). The cap at 0.30 interacts correctly with all combinations of runtime size and CI environment. The median change is applied consistently to both baseline and candidate via total_passed_runtime().

Duplicate Detection

No duplicates detected. timing_coefficient_of_variation() and the updated total_passed_runtime() are defined only once in the codebase.

Test Coverage

test_critic.py passes 15/15 tests. The new methods are tested, including edge cases (empty results, zero variance, even/odd counts). Two edge-case branches remain untested:

  • len(runtimes) < 2 path in timing_coefficient_of_variation (line 1000)
  • mean == 0 guard (line 1003)

Neither is a blocker, but a test with a single-iteration result set would document the no-op behavior noted in finding #3.


Last updated: 2026-04-01T17:31Z

The timing_coefficient_of_variation() was mixing all raw loop iterations
from all test cases into one flat list, causing inter-test-case runtime
differences to inflate CV to 50-100%+ on CI. This made the noise floor
so high it rejected genuine 2x speedups. Now computes CV within each
test case's loop iterations (measuring actual measurement noise) and
returns the median. Also caps the CV-based noise floor at 30%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 20% (0.20x) speedup for TestResults.total_passed_runtime in codeflash/models/models.py

⏱️ Runtime : 21.9 microseconds 18.3 microseconds (best of 27 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

Static Badge

The hot-path `timing_coefficient_of_variation()` was replaced with Welford's single-pass algorithm to compute sample standard deviation and mean in one traversal instead of calling `statistics.mean()` and `statistics.stdev()` separately (which each iterate the list). Line profiler shows the original's `statistics.stdev()` consumed 47.6% of function runtime; the new `_compute_sample_cv` cuts that to 16.2% by eliminating redundant passes and reducing overhead from Python's general-purpose statistics module. Overall runtime drops 77% (245 µs → 55.8 µs), a key speedup in `process_single_candidate` where this method gates candidate evaluation.
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 338% (3.38x) speedup for TestResults.timing_coefficient_of_variation in codeflash/models/models.py

⏱️ Runtime : 245 microseconds 55.8 microseconds (best of 250 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

Static Badge

github-actions bot and others added 2 commits April 1, 2026 17:57
The hot path short-circuits async throughput and concurrency evaluation when runtime alone qualifies for acceptance, eliminating ~12 ns of conditional logic in the common case (45% of total calls in this codebase return True on runtime alone). A local `optimized_runtime` variable avoids two redundant `candidate_result.best_test_runtime` attribute lookups. The original code always computed throughput/concurrency flags then combined them in a final OR; the optimized version returns immediately once any criterion passes, cutting profiled time from 195 ns to 164 ns per call — a 16% reduction in overhead without altering acceptance logic.
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for speedup_critic in codeflash/result/critic.py

⏱️ Runtime : 37.5 microseconds 34.0 microseconds (best of 24 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

Static Badge

…2026-04-01T17.51.09

⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`)
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Apr 1, 2026

…2026-04-01T18.00.49

⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`)
@codeflash-ai
Copy link
Copy Markdown
Contributor

codeflash-ai bot commented Apr 1, 2026

This PR is now faster! 🚀 @claude[bot] accepted my optimizations from:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant