fix: use median timing and variance-aware noise floor to reduce benchmark false positives by mashraf-222 · Pull Request #1949 · codeflash-ai/codeflash

mashraf-222 · 2026-04-01T16:03:52Z

Problem

The benchmarking pipeline accepts benchmark noise as real speedups, creating PRs for provable no-ops with claimed speedups of 7-151%. Two compounding issues:

Min-based timing aggregation — total_passed_runtime() uses min() across loop iterations, biasing toward outlier-fast runs (lucky GC, perfect cache alignment)
Static noise floor — speedup_critic() uses a fixed 5%/15% threshold that doesn't adapt to observed benchmark variance

Evidence: 4 commons-lang PRs added unused fields with 0 code changes to target methods yet reported speedups. Guava PRs #2, #22, #24 claimed 16-37% for JIT no-ops.

Root Cause

models.py:total_passed_runtime() — min() picks the fastest outlier, not representative performance
critic.py:speedup_critic() — fixed noise floor ignores actual baseline variance (a 5% threshold is meaningless when baseline has 20% variance)

Fix

Median aggregation: Replace min() with statistics.median_low() in total_passed_runtime() — median_low returns an actual data point (int), more representative of typical performance
Variance-aware noise floor: Add timing_coefficient_of_variation() to TestResults and pass the baseline CV to speedup_critic(). If baseline CV exceeds the static noise floor, the noise floor is raised to match: noise_floor = max(noise_floor, baseline_cv)
Function optimizer integration: Compute baseline CV from benchmarking_test_results and pass it through to the critic

Validation

14/14 critic tests pass (3 new for median, 1 for CV, 1 for CV-aware critic)
Java test_run_and_parse assertions updated from min to median expectations
uv run prek passes

Test Coverage

New tests:

test_speedup_critic_with_baseline_cv — verifies CV raises noise floor
test_total_passed_runtime_uses_median — verifies median_low aggregation
test_total_passed_runtime_median_even_count — verifies median_low for even data
test_timing_coefficient_of_variation — verifies CV calculation

Updated tests:

test_run_and_parse.py — assertions changed from "min of runtimes" to "median of runtimes"

Closes CF-1082

…mark false positives Replace min() with median_low() for timing aggregation in total_passed_runtime(), reducing sensitivity to outlier-fast iterations that produce spurious speedups. Add timing_coefficient_of_variation() to measure baseline benchmark variance and adapt the speedup critic noise floor: if observed variance exceeds the static threshold, raise the noise floor to match, preventing acceptance of noise as real gains. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-04-01T16:04:25Z

Claude finished @mashraf-222's task in 5m 23s —— View job

PR Review Summary

Prek Checks

Passed — no linting or formatting issues.

Code Review

1. Stale comment not updated — models.py:949

# can track mins here and only sums can be return in total_passed_runtime

This comment still says "mins" but the PR changed the aggregation to median. A future reader will be confused about why the code computes a median while the comment says min. Fix this →

2. import statistics inside function bodies — models.py:973, 991

Both total_passed_runtime() and timing_coefficient_of_variation() do import statistics inside the function rather than at module level. statistics is a stdlib module — the standard convention is a top-level import. Python does cache it, but this pattern is unusual and these functions are in a hot path (called per candidate per benchmark). Fix this →

3. CV-based noise floor is a no-op when loop_count=1

timing_coefficient_of_variation() skips any test case with len(runtimes) < 2 and returns 0.0 if no test case qualifies. For single-iteration benchmarks (loop_count=1), every test case has exactly one runtime, so per_test_cvs is always empty and 0.0 is returned — the noise floor never adapts.

This is worth calling out explicitly: the CV component of the fix only activates when benchmarks run with ≥2 loop iterations per test case. If the false-positive PRs cited in the PR description were generated with loop_count=1, the CV adjustment won't help those cases.

4. Logic is correct — no bugs

The condition baseline_timing_cv > noise_floor correctly ensures CV only raises the floor, never lowers an already-higher one (e.g., CI-doubled floors). The cap at 0.30 interacts correctly with all combinations of runtime size and CI environment. The median change is applied consistently to both baseline and candidate via total_passed_runtime().

Duplicate Detection

No duplicates detected. timing_coefficient_of_variation() and the updated total_passed_runtime() are defined only once in the codebase.

Test Coverage

test_critic.py passes 15/15 tests. The new methods are tested, including edge cases (empty results, zero variance, even/odd counts). Two edge-case branches remain untested:

len(runtimes) < 2 path in timing_coefficient_of_variation (line 1000)
mean == 0 guard (line 1003)

Neither is a blocker, but a test with a single-iteration result set would document the no-op behavior noted in finding #3.

Last updated: 2026-04-01T17:31Z

The timing_coefficient_of_variation() was mixing all raw loop iterations from all test cases into one flat list, causing inter-test-case runtime differences to inflate CV to 50-100%+ on CI. This made the noise floor so high it rejected genuine 2x speedups. Now computes CV within each test case's loop iterations (measuring actual measurement noise) and returns the median. Also caps the CV-based noise floor at 30%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeflash-ai · 2026-04-01T17:42:33Z

⚡️ Codeflash found optimizations for this PR

📄 20% (0.20x) speedup for `TestResults.total_passed_runtime` in `codeflash/models/models.py`

⏱️ Runtime : 21.9 microseconds → 18.3 microseconds (best of 27 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method TestResults.total_passed_runtime by 20% in PR #1949 (cf-1082-benchmark-noise-floor) #1954

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

The hot-path `timing_coefficient_of_variation()` was replaced with Welford's single-pass algorithm to compute sample standard deviation and mean in one traversal instead of calling `statistics.mean()` and `statistics.stdev()` separately (which each iterate the list). Line profiler shows the original's `statistics.stdev()` consumed 47.6% of function runtime; the new `_compute_sample_cv` cuts that to 16.2% by eliminating redundant passes and reducing overhead from Python's general-purpose statistics module. Overall runtime drops 77% (245 µs → 55.8 µs), a key speedup in `process_single_candidate` where this method gates candidate evaluation.

codeflash-ai · 2026-04-01T17:51:18Z

⚡️ Codeflash found optimizations for this PR

📄 338% (3.38x) speedup for `TestResults.timing_coefficient_of_variation` in `codeflash/models/models.py`

⏱️ Runtime : 245 microseconds → 55.8 microseconds (best of 250 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method TestResults.timing_coefficient_of_variation by 338% in PR #1949 (cf-1082-benchmark-noise-floor) #1955

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

…entions

The hot path short-circuits async throughput and concurrency evaluation when runtime alone qualifies for acceptance, eliminating ~12 ns of conditional logic in the common case (45% of total calls in this codebase return True on runtime alone). A local `optimized_runtime` variable avoids two redundant `candidate_result.best_test_runtime` attribute lookups. The original code always computed throughput/concurrency flags then combined them in a final OR; the optimized version returns immediately once any criterion passes, cutting profiled time from 195 ns to 164 ns per call — a 16% reduction in overhead without altering acceptance logic.

codeflash-ai · 2026-04-01T18:00:58Z

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for `speedup_critic` in `codeflash/result/critic.py`

⏱️ Runtime : 37.5 microseconds → 34.0 microseconds (best of 24 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function speedup_critic by 10% in PR #1949 (cf-1082-benchmark-noise-floor) #1956

If you approve, it will be merged into this PR (branch cf-1082-benchmark-noise-floor).

…2026-04-01T17.51.09 ⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`)

codeflash-ai · 2026-04-01T18:12:09Z

This PR is now faster! 🚀 @claude[bot] accepted my optimizations from:

⚡️ Speed up method TestResults.timing_coefficient_of_variation by 338% in PR #1949 (cf-1082-benchmark-noise-floor) #1955

…2026-04-01T18.00.49 ⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`)

codeflash-ai · 2026-04-01T22:37:03Z

This PR is now faster! 🚀 @claude[bot] accepted my optimizations from:

⚡️ Speed up function speedup_critic by 10% in PR #1949 (cf-1082-benchmark-noise-floor) #1956

codeflash-ai bot mentioned this pull request Apr 1, 2026

⚡️ Speed up method TestResults.total_passed_runtime by 20% in PR #1949 (cf-1082-benchmark-noise-floor) #1954

Open

codeflash-ai bot mentioned this pull request Apr 1, 2026

⚡️ Speed up method TestResults.timing_coefficient_of_variation by 338% in PR #1949 (cf-1082-benchmark-noise-floor) #1955

Merged

github-actions bot and others added 2 commits April 1, 2026 17:57

style: rename _compute_sample_cv to compute_sample_cv per naming conv…

319e8dd

…entions

codeflash-ai bot mentioned this pull request Apr 1, 2026

⚡️ Speed up function speedup_critic by 10% in PR #1949 (cf-1082-benchmark-noise-floor) #1956

Merged

Merge pull request #1955 from codeflash-ai/codeflash/optimize-pr1949-…

fc9a6b2

…2026-04-01T17.51.09 ⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`)

Merge pull request #1956 from codeflash-ai/codeflash/optimize-pr1949-…

6a81de9

…2026-04-01T18.00.49 ⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949

fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949
mashraf-222 wants to merge 7 commits intomainfrom
cf-1082-benchmark-noise-floor

mashraf-222 commented Apr 1, 2026

Uh oh!

claude bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Speed up method `TestResults.total_passed_runtime` by 20% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1954

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1955

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1956

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mashraf-222 commented Apr 1, 2026

Problem

Root Cause

Fix

Validation

Test Coverage

Uh oh!

claude bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Prek Checks

Code Review

Duplicate Detection

Test Coverage

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 20% (0.20x) speedup for TestResults.total_passed_runtime in codeflash/models/models.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method TestResults.total_passed_runtime by 20% in PR #1949 (cf-1082-benchmark-noise-floor) #1954

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 338% (3.38x) speedup for TestResults.timing_coefficient_of_variation in codeflash/models/models.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up method TestResults.timing_coefficient_of_variation by 338% in PR #1949 (cf-1082-benchmark-noise-floor) #1955

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for speedup_critic in codeflash/result/critic.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function speedup_critic by 10% in PR #1949 (cf-1082-benchmark-noise-floor) #1956

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 1, 2026 •

edited

Loading

📄 20% (0.20x) speedup for `TestResults.total_passed_runtime` in `codeflash/models/models.py`

⚡️ Speed up method `TestResults.total_passed_runtime` by 20% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1954

📄 338% (3.38x) speedup for `TestResults.timing_coefficient_of_variation` in `codeflash/models/models.py`

⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1955

📄 10% (0.10x) speedup for `speedup_critic` in `codeflash/result/critic.py`

⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`) #1956