fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949
fix: use median timing and variance-aware noise floor to reduce benchmark false positives#1949mashraf-222 wants to merge 7 commits intomainfrom
Conversation
…mark false positives Replace min() with median_low() for timing aggregation in total_passed_runtime(), reducing sensitivity to outlier-fast iterations that produce spurious speedups. Add timing_coefficient_of_variation() to measure baseline benchmark variance and adapt the speedup critic noise floor: if observed variance exceeds the static threshold, raise the noise floor to match, preventing acceptance of noise as real gains. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Claude finished @mashraf-222's task in 5m 23s —— View job PR Review Summary
Prek ChecksPassed — no linting or formatting issues. Code Review1. Stale comment not updated — # can track mins here and only sums can be return in total_passed_runtimeThis comment still says "mins" but the PR changed the aggregation to median. A future reader will be confused about why the code computes a median while the comment says min. Fix this → 2. Both 3. CV-based noise floor is a no-op when
This is worth calling out explicitly: the CV component of the fix only activates when benchmarks run with ≥2 loop iterations per test case. If the false-positive PRs cited in the PR description were generated with 4. Logic is correct — no bugs The condition Duplicate DetectionNo duplicates detected. Test Coverage
Neither is a blocker, but a test with a single-iteration result set would document the no-op behavior noted in finding #3. Last updated: 2026-04-01T17:31Z |
The timing_coefficient_of_variation() was mixing all raw loop iterations from all test cases into one flat list, causing inter-test-case runtime differences to inflate CV to 50-100%+ on CI. This made the noise floor so high it rejected genuine 2x speedups. Now computes CV within each test case's loop iterations (measuring actual measurement noise) and returns the median. Also caps the CV-based noise floor at 30%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
⚡️ Codeflash found optimizations for this PR📄 20% (0.20x) speedup for
|
The hot-path `timing_coefficient_of_variation()` was replaced with Welford's single-pass algorithm to compute sample standard deviation and mean in one traversal instead of calling `statistics.mean()` and `statistics.stdev()` separately (which each iterate the list). Line profiler shows the original's `statistics.stdev()` consumed 47.6% of function runtime; the new `_compute_sample_cv` cuts that to 16.2% by eliminating redundant passes and reducing overhead from Python's general-purpose statistics module. Overall runtime drops 77% (245 µs → 55.8 µs), a key speedup in `process_single_candidate` where this method gates candidate evaluation.
⚡️ Codeflash found optimizations for this PR📄 338% (3.38x) speedup for
|
The hot path short-circuits async throughput and concurrency evaluation when runtime alone qualifies for acceptance, eliminating ~12 ns of conditional logic in the common case (45% of total calls in this codebase return True on runtime alone). A local `optimized_runtime` variable avoids two redundant `candidate_result.best_test_runtime` attribute lookups. The original code always computed throughput/concurrency flags then combined them in a final OR; the optimized version returns immediately once any criterion passes, cutting profiled time from 195 ns to 164 ns per call — a 16% reduction in overhead without altering acceptance logic.
⚡️ Codeflash found optimizations for this PR📄 10% (0.10x) speedup for
|
…2026-04-01T17.51.09 ⚡️ Speed up method `TestResults.timing_coefficient_of_variation` by 338% in PR #1949 (`cf-1082-benchmark-noise-floor`)
|
This PR is now faster! 🚀 @claude[bot] accepted my optimizations from: |
…2026-04-01T18.00.49 ⚡️ Speed up function `speedup_critic` by 10% in PR #1949 (`cf-1082-benchmark-noise-floor`)
|
This PR is now faster! 🚀 @claude[bot] accepted my optimizations from: |
Problem
The benchmarking pipeline accepts benchmark noise as real speedups, creating PRs for provable no-ops with claimed speedups of 7-151%. Two compounding issues:
total_passed_runtime()usesmin()across loop iterations, biasing toward outlier-fast runs (lucky GC, perfect cache alignment)speedup_critic()uses a fixed 5%/15% threshold that doesn't adapt to observed benchmark varianceEvidence: 4 commons-lang PRs added unused fields with 0 code changes to target methods yet reported speedups. Guava PRs #2, #22, #24 claimed 16-37% for JIT no-ops.
Root Cause
models.py:total_passed_runtime()—min()picks the fastest outlier, not representative performancecritic.py:speedup_critic()— fixed noise floor ignores actual baseline variance (a 5% threshold is meaningless when baseline has 20% variance)Fix
min()withstatistics.median_low()intotal_passed_runtime()—median_lowreturns an actual data point (int), more representative of typical performancetiming_coefficient_of_variation()toTestResultsand pass the baseline CV tospeedup_critic(). If baseline CV exceeds the static noise floor, the noise floor is raised to match:noise_floor = max(noise_floor, baseline_cv)benchmarking_test_resultsand pass it through to the criticValidation
uv run prekpassesTest Coverage
New tests:
test_speedup_critic_with_baseline_cv— verifies CV raises noise floortest_total_passed_runtime_uses_median— verifies median_low aggregationtest_total_passed_runtime_median_even_count— verifies median_low for even datatest_timing_coefficient_of_variation— verifies CV calculationUpdated tests:
test_run_and_parse.py— assertions changed from "min of runtimes" to "median of runtimes"Closes CF-1082