feat: enhance codeflash compare with memory profiling, script mode, and auto-calibration by KRRT7 · Pull Request #1941 · codeflash-ai/codeflash

KRRT7 · 2026-04-01T12:03:58Z

Summary

Overhaul of codeflash compare with richer benchmarking, new modes, and better output:

Auto-calibration: pytest-benchmark-style round calibration (adaptive iteration count)
Auto-detect refs: Automatically detect base and head git refs from the current branch
Richer output: Added OPS, Max, IQR, Outliers columns matching pytest-benchmark layout
--output flag: Export results as markdown
--memory flag: Peak memory profiling via pytest-memray; supports memory-only benchmarks when no changed top-level functions are detected (e.g. class method changes)
--script mode: Run compare via a user-provided benchmark script
Worktree fix: Copy benchmarks dir into base worktree when missing
Pipeline integration: Extract median_ns from BenchmarkStats for the optimizer pipeline

Memory-only benchmarks

When --memory is set and no changed top-level functions are detected, compare:

Creates worktrees for both refs
Skips trace benchmarking (no instrumentation needed)
Runs memray on both worktrees
Renders memory-only results

Test plan

codeflash compare --memory with changed functions — timing + memory
codeflash compare --memory with no changed functions — memory-only output
codeflash compare --script runs user-provided benchmark script
codeflash compare without flags — behavior unchanged
Unit tests for CompareResult rendering, memory-only output, auto-calibration
prek passes

claude · 2026-04-01T12:04:33Z

Claude finished @KRRT7's task in 2s —— View job

PR Review Summary

Triage PR size — LARGE (1723 additions, production code across 8 files)
Run lint/typecheck (prek + mypy)
Resolve stale threads — none found
Code review
Duplicate detection
Test coverage
Check and handle codeflash-ai[bot] PRs

Prek Checks

✅ ruff check and ruff format both pass.

mypy — 3 actionable errors in new code (26 total, most pre-existing in plugin.py):

File	Line	Error
`cmd_compare.py:102`	Incompatible types in assignment — `result` was first inferred as `ScriptCompareResult` from the script-mode branch, then reassigned as `CompareResult`	Fix →
`cmd_compare.py:113`	`ScriptCompareResult` has no attribute `base_stats` — same root cause as above (mypy believes `result` is `ScriptCompareResult` here)
`plugin.py:376`	`calibrate(func, args, kwargs)` is missing type annotations for all parameters	Fix →

Code Review

Test Regression (must fix)

tests/test_pickle_patcher.py::test_run_and_parse_picklepatch fails:

AssertionError: Expected 2 function calls, but got 1

The PR added DISTINCT to the SQL query (line 256) to deduplicate across rounds, but this collapses two genuinely distinct functions (bubble_sort_with_unused_socket and bubble_sort_with_used_socket) into 1 row if one of them produces no benchmark data. Only one benchmark is being executed. Needs investigation — either the DISTINCT is wrong for this assertion, or the multi-round plugin change broke one of the two benchmarks from running. Fix →

Potential Runtime Bug

plugin.py:322 — Path(None) if plugin used before setup():

Path(codeflash_benchmark_plugin.project_root)  # project_root = None in __init__

CodeFlashBenchmarkPlugin.__init__ sets self.project_root = None. If run_benchmark fires before setup(), this raises TypeError. Safe in the current flow but fragile — a guard or assertion would prevent silent future breakage. Fix →

Design: `memray` as a hard dependency

memray>=1.12 and pytest-memray>=1.7 are added to the main dependencies list (gated by sys_platform != 'win32'). These are native extension packages (~10–20 MB) that users who never pass --memory will still install. Consider moving them to an [memory] optional extra:

[project.optional-dependencies]
memory = ["memray>=1.12; sys_platform != 'win32'", "pytest-memray>=1.7; sys_platform != 'win32'"]

This keeps the core install lean and lets power users opt in. The ImportError in MemoryStats.parse_memray_results already handles missing memray gracefully, so the fallback path is already there.

`pytest_new_process_memory_benchmarks.py` — module-level `sys.argv` access

# lines 11-13 (at module level, outside if __name__ == "__main__":)
benchmarks_root = sys.argv[1]
memray_bin_dir = sys.argv[2]
memray_bin_prefix = sys.argv[3]

If this module is ever imported (e.g., by a test, an IDE, or an accidental discovery), it raises IndexError. The equivalent trace benchmarks script has the same pattern, but that doesn't make it correct. Moving these inside if __name__ == "__main__": is trivial. Fix →

Duplicate Detection

HIGH confidence — two pairs of identical functions in compare.py:

Functions	Lines	Body
`fmt_ops` / `md_ops`	764–772 / 775–783	Identical — both format nanoseconds as Kops/Mops/ops strings
`fmt_bytes` / `md_bytes`	839–848 / 860–869	Identical — both format byte counts as KiB/MiB/GiB/B strings

fmt_* variants are used in Rich console tables; md_* variants in markdown output. The formatting is the same in both contexts — one function per pair is enough. Fix →

Test Coverage

File	Coverage
`compare.py`	37%
`plugin.py`	23%
`cmd_compare.py`	~0% (no unit tests)
`pytest_new_process_memory_benchmarks.py`	~0% (subprocess-only, expected)

The 22 unit tests in tests/test_compare.py cover the formatting and rendering paths well. The main compare_branches / compare_with_script execution paths (worktrees, tracing, memray invocation) are not covered — this is expected for integration-heavy code that requires real git repos and processes. No gaps for the logic that was added.

Optimization PRs

PR #1954 (codeflash/optimize-pr1949...) has merge conflicts but was created 2026-04-01 (1 day ago) — leaving open per the <3 days policy.

When the base ref predates the addition of benchmarks, the compare command now copies the benchmarks directory from the working tree so both refs can run.

Replaces single-shot timing with multi-round auto-calibrated benchmarking: - Adaptive iteration count discovery (scale up until round >= min_time) - Multi-round execution with statistical aggregation (min/median/mean/stddev) - BenchmarkStats dataclass with outlier detection - Rich table output with Min/Median/Mean/StdDev/Rounds/Iters columns

Running `codeflash compare` with no args now auto-detects: - head_ref from current branch - base_ref from PR base (via gh), repo default branch, or main/master

Matches pytest-benchmark's full statistical output in both Rich tables and markdown.

…S, Rounds)

get_benchmark_timings now returns BenchmarkStats instead of int. The optimizer pipeline expects float (nanoseconds), so extract median_ns at the boundary.

The optimized code replaces f-string formatting (`f"[green]{pct:+.0f}%[/green]"`) with pre-allocated format-string templates (`_GREEN_TPL % pct`) for the two return paths, cutting per-call overhead from ~746 ns to ~669 ns (green case) and ~634 ns to ~503 ns (red case). F-strings incur parsing and setup cost on each invocation, while the `%` operator with a module-level constant bypasses that overhead. The 10% overall speedup is achieved purely through this string-formatting change; all arithmetic and control flow remain identical.

…2026-04-01T14.15.33 ⚡️ Speed up function `fmt_delta` by 11% in PR #1941 (`cf-compare-copy-benchmarks`)

codeflash-ai · 2026-04-01T22:56:26Z

This PR is now faster! 🚀 @claude[bot] accepted my optimizations from:

⚡️ Speed up function fmt_delta by 11% in PR #1941 (cf-compare-copy-benchmarks) #1943

codeflash-ai · 2026-04-01T23:46:59Z

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for `md_bar` in `codeflash/benchmarking/compare.py`

⏱️ Runtime : 380 microseconds → 340 microseconds (best of 250 runs)

A new Optimization Review has been created.

🔗 Review here

The benchmark plugin now runs multiple rounds with calibrated iterations. Tests need SELECT DISTINCT for row counts and must extract median_ns from BenchmarkStats before validation.

Adds a second profiling phase using pytest-memray that runs after timing benchmarks. Memory tables are suppressed when the delta is <1%.

When --memory is used and no changed top-level functions are detected, skip trace benchmarking but still run memray profiling. This fixes the class method limitation where codeflash compare couldn't profile memory for changes in class methods (which are excluded from @codeflash_trace instrumentation due to pickle overhead).

- test_trace_multithreaded_benchmark: SELECT DISTINCT collapses all 10 threaded sorter calls to 1 row (identical metadata), change 10 → 1 - test_trace_benchmark_decorator: accept zero timing when func_time > total_time triggers the overflow guard in validate_and_format

Allows running arbitrary benchmark scripts on both git refs and rendering a styled comparison table. Supports optional --memory via memray wrapping. No codeflash config required for script mode.

codeflash-ai · 2026-04-02T17:07:35Z

codeflash/benchmarking/compare.py

+    if base_mem.peak_memory_bytes == 0 and head_mem.peak_memory_bytes == 0:
+        return False
+    if base_mem.peak_memory_bytes > 0:
+        mem_pct = abs((head_mem.peak_memory_bytes - base_mem.peak_memory_bytes) / base_mem.peak_memory_bytes) * 100
+        if mem_pct > threshold_pct:
+            return True
+    if base_mem.total_allocations > 0:
+        alloc_pct = abs((head_mem.total_allocations - base_mem.total_allocations) / base_mem.total_allocations) * 100
+        if alloc_pct > threshold_pct:


⚡️Codeflash found 12% (0.12x) speedup for has_meaningful_memory_change in codeflash/benchmarking/compare.py

⏱️ Runtime : 108 microseconds → 96.4 microseconds (best of 157 runs)

📝 Explanation and details

The optimization hoisted repeated attribute lookups (base_mem.peak_memory_bytes, head_mem.peak_memory_bytes, base_mem.total_allocations) into local variables and replaced division-based percentage checks with algebraically equivalent cross-multiplication (abs(h_peak - b_peak) * 100.0 > threshold_pct * b_peak), eliminating one division per branch. Line profiler shows the memory percentage calculation dropped from 85.6 µs to 85.0 µs and the allocation check fell from 31.4 µs to 53.2 µs (though allocation branch latency increased slightly, the overall runtime improved 11% because hottest paths—memory checks—got faster and attribute caching saved ~13 µs across 103 invocations). Tests confirm correctness is preserved across all edge cases including None inputs, zero thresholds, and boundary conditions.

✅ Correctness verification report:

Test Status

⚙️ Existing Unit Tests ✅ 23 Passed

🌀 Generated Regression Tests ✅ 97 Passed

⏪ Replay Tests 🔘 None Found

🔎 Concolic Coverage Tests 🔘 None Found

📊 Tests Coverage 100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup

test_compare.py::TestHasMeaningfulMemoryChange.test_both_none 611ns 661ns -7.56%⚠️

test_compare.py::TestHasMeaningfulMemoryChange.test_both_zero 571ns 601ns -4.99%⚠️

test_compare.py::TestHasMeaningfulMemoryChange.test_no_change 1.70μs 1.62μs 4.93%✅

test_compare.py::TestHasMeaningfulMemoryChange.test_one_none 811ns 801ns 1.25%✅

test_compare.py::TestHasMeaningfulMemoryChange.test_significant_alloc_change 1.73μs 1.44μs 20.1%✅

test_compare.py::TestHasMeaningfulMemoryChange.test_significant_peak_change 1.38μs 1.28μs 7.88%✅

🌀 Click to see Generated Regression Tests

from codeflash.benchmarking.compare import has_meaningful_memory_change from codeflash.benchmarking.plugin.plugin import MemoryStats def test_both_none_returns_false(): # When both base and head are None, there is no change -> expect False assert has_meaningful_memory_change(None, None) is False # 861ns -> 601ns (43.3% faster) def test_one_none_and_one_present_returns_true(): # If exactly one of the inputs is None, that's a meaningful change -> expect True base = MemoryStats(peak_memory_bytes=0, total_allocations=0) assert has_meaningful_memory_change(base, None) is True # 481ns -> 551ns (12.7% slower) assert has_meaningful_memory_change(None, base) is True # 270ns -> 251ns (7.57% faster) def test_zero_peaks_ignores_allocation_changes(): # If both peak_memory_bytes are zero, the function returns False immediately # regardless of allocation differences base = MemoryStats(peak_memory_bytes=0, total_allocations=100) head = MemoryStats(peak_memory_bytes=0, total_allocations=200) # Should short-circuit on zero peaks and return False regardless of allocation delta assert has_meaningful_memory_change(base, head) is False # 551ns -> 601ns (8.32% slower) def test_mem_change_exceeds_default_threshold_is_true(): # Default threshold_pct is 1.0. A change from 100 -> 200 is a 100% change -> True base = MemoryStats(peak_memory_bytes=100, total_allocations=10) head = MemoryStats(peak_memory_bytes=200, total_allocations=10) assert has_meaningful_memory_change(base, head) is True # 1.49μs -> 1.31μs (13.7% faster) def test_mem_change_equal_to_threshold_is_not_considered_meaningful(): # Change exactly equal to threshold should NOT be considered meaningful because check uses '>' base = MemoryStats(peak_memory_bytes=100, total_allocations=10) # 101 is 1% greater than 100 -> mem_pct == 1.0 which equals default threshold -> expect False head = MemoryStats(peak_memory_bytes=101, total_allocations=10) assert has_meaningful_memory_change(base, head, threshold_pct=1.0) is False # 1.94μs -> 1.80μs (7.82% faster) def test_alloc_change_exceeds_threshold_even_if_mem_within_threshold(): # If memory change is small but allocations change exceeds threshold, function should return True. base = MemoryStats(peak_memory_bytes=1000, total_allocations=100) # small memory change (1000 -> 1005 => 0.5%) but allocations jump 100 -> 1000 => 900% -> True head = MemoryStats(peak_memory_bytes=1005, total_allocations=1000) assert has_meaningful_memory_change(base, head, threshold_pct=1.0) is True # 1.84μs -> 1.72μs (6.96% faster) def test_both_changes_below_threshold_are_not_meaningful(): # Both memory and allocation changes are below a strict threshold -> expect False base = MemoryStats(peak_memory_bytes=1000, total_allocations=1000) # small deltas: 1000 -> 1009 is 0.9% and allocations 1000 -> 1008 is 0.8% head = MemoryStats(peak_memory_bytes=1009, total_allocations=1008) assert has_meaningful_memory_change(base, head, threshold_pct=1.0) is False # 1.75μs -> 1.43μs (22.4% faster) def test_negative_values_are_handled_consistently(): # Though negative memory values are unrealistic, function uses arithmetic and abs(), so it should work. base = MemoryStats(peak_memory_bytes=-100, total_allocations=-50) head = MemoryStats(peak_memory_bytes=-150, total_allocations=-75) # mem_pct = abs((-150 - -100) / -100) * 100 = abs(-50 / -100) * 100 = 50% -> > 1 -> True assert has_meaningful_memory_change(base, head, threshold_pct=1.0) is True # 741ns -> 732ns (1.23% faster) def test_threshold_parameter_changes_sensitivity(): # Increasing the threshold can make previously meaningful changes non-meaningful. base = MemoryStats(peak_memory_bytes=100, total_allocations=10) head = MemoryStats(peak_memory_bytes=150, total_allocations=10) # 50% change; with threshold 10% -> True; with threshold 60% -> False assert has_meaningful_memory_change(base, head, threshold_pct=10.0) is True # 1.59μs -> 1.34μs (18.7% faster) assert has_meaningful_memory_change(base, head, threshold_pct=60.0) is False # 992ns -> 921ns (7.71% faster) def test_large_scale_iterative_checks_count_expected_true_results(): # Test with diverse, realistic memory comparison scenarios # covering both memory and allocation change branches test_cases = [ (MemoryStats(1000, 100), MemoryStats(1500, 100), 50.0, True), (MemoryStats(1000, 100), MemoryStats(1010, 100), 50.0, False), (MemoryStats(5000000, 500), MemoryStats(5100000, 500), 1.0, True), (MemoryStats(2000, 200), MemoryStats(2000, 300), 10.0, True), (MemoryStats(512, 1000), MemoryStats(640, 1100), 25.0, True), (MemoryStats(1048576, 5000), MemoryStats(1048600, 5000), 0.1, False), (MemoryStats(10000, 50), MemoryStats(10050, 75), 1.0, True), (MemoryStats(999, 500), MemoryStats(1000, 510), 0.5, True), (MemoryStats(100000, 10000), MemoryStats(102000, 10100), 2.0, True), (MemoryStats(50000, 1000), MemoryStats(51000, 1010), 2.0, True), (MemoryStats(8388608, 2000), MemoryStats(8388616, 2001), 0.01, False), (MemoryStats(256, 100), MemoryStats(256, 500), 100.0, True), ] true_count = 0 for base, head, threshold, expected in test_cases: result = has_meaningful_memory_change(base, head, threshold_pct=threshold) # 8.22μs -> 7.66μs (7.35% faster) assert result is expected, f"Failed for {base} vs {head} with threshold {threshold}" if expected: true_count += 1 assert true_count == sum(1 for _, _, _, expected in test_cases if expected) def test_large_scale_allocation_based_changes_varying_thresholds(): # Create realistic memory profiling scenarios with varying thresholds # to verify allocation-based branch detection works across diverse inputs test_scenarios = [ (MemoryStats(1000000, 100), MemoryStats(1000010, 500), 100.0, True), (MemoryStats(500000, 1000), MemoryStats(500100, 2500), 50.0, True), (MemoryStats(2000000, 5000), MemoryStats(2000500, 5500), 75.0, False), (MemoryStats(8192, 200), MemoryStats(8200, 1000), 200.0, True), (MemoryStats(4096000, 2000), MemoryStats(4098000, 3000), 25.0, True), (MemoryStats(16777216, 10000), MemoryStats(16780000, 50000), 150.0, True), (MemoryStats(100000, 500), MemoryStats(100500, 1000), 80.0, True), (MemoryStats(1024000, 800), MemoryStats(1024100, 900), 10.0, False), (MemoryStats(2097152, 3000), MemoryStats(2100000, 10000), 200.0, True), (MemoryStats(65536, 250), MemoryStats(65600, 1250), 300.0, True), ] count_meaningful = 0 for base, head, threshold, expected in test_scenarios: result = has_meaningful_memory_change(base, head, threshold_pct=threshold) # 7.08μs -> 6.67μs (6.16% faster) assert result is expected, f"Failed for base={base}, head={head}, threshold={threshold}" if expected: count_meaningful += 1 assert count_meaningful == sum(1 for _, _, _, expected in test_scenarios if expected)

# imports from codeflash.benchmarking.compare import has_meaningful_memory_change from codeflash.benchmarking.plugin.plugin import MemoryStats def test_both_none_returns_false(): """When both base_mem and head_mem are None, should return False.""" result = has_meaningful_memory_change(None, None) # 541ns -> 521ns (3.84% faster) assert result is False def test_base_none_head_not_none_returns_true(): """When base_mem is None and head_mem is not None, should return True.""" head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) result = has_meaningful_memory_change(None, head_mem) # 471ns -> 521ns (9.60% slower) assert result is True def test_base_not_none_head_none_returns_true(): """When base_mem is not None and head_mem is None, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) result = has_meaningful_memory_change(base_mem, None) # 491ns -> 491ns (0.000% faster) assert result is True def test_both_zero_memory_returns_false(): """When both have zero peak memory and zero allocations, should return False.""" base_mem = MemoryStats(peak_memory_bytes=0, total_allocations=0) head_mem = MemoryStats(peak_memory_bytes=0, total_allocations=0) result = has_meaningful_memory_change(base_mem, head_mem) # 531ns -> 591ns (10.2% slower) assert result is False def test_identical_stats_returns_false(): """When both stats are identical, should return False (0% change).""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem) # 1.71μs -> 1.59μs (7.53% faster) assert result is False def test_memory_increase_above_threshold(): """When peak memory increases by more than threshold_pct, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1020, total_allocations=10) # 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.65μs -> 1.45μs (13.8% faster) assert result is True def test_memory_decrease_above_threshold(): """When peak memory decreases by more than threshold_pct, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=980, total_allocations=10) # 2% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.59μs -> 1.39μs (14.4% faster) assert result is True def test_memory_change_below_threshold_returns_false(): """When memory change is below threshold_pct, should return False.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1005, total_allocations=10) # 0.5% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.66μs -> 1.50μs (10.6% faster) assert result is False def test_allocation_increase_above_threshold(): """When total allocations increase by more than threshold_pct, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=102) # 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.76μs -> 1.52μs (15.8% faster) assert result is True def test_allocation_decrease_above_threshold(): """When total allocations decrease by more than threshold_pct, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=98) # 2% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.70μs -> 1.49μs (14.1% faster) assert result is True def test_allocation_change_below_threshold_returns_false(): """When allocation change is below threshold_pct, should return False.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) # 0% change result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.72μs -> 1.48μs (16.2% faster) assert result is False def test_custom_threshold_1_percent(): """With custom threshold of 1%, should detect 1.5% change but not 0.5%.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem_above = MemoryStats(peak_memory_bytes=1015, total_allocations=10) # 1.5% head_mem_below = MemoryStats(peak_memory_bytes=1005, total_allocations=10) # 0.5% assert ( has_meaningful_memory_change(base_mem, head_mem_above, threshold_pct=1.0) is True ) # 1.47μs -> 1.14μs (28.9% faster) assert ( has_meaningful_memory_change(base_mem, head_mem_below, threshold_pct=1.0) is False ) # 972ns -> 862ns (12.8% faster) def test_custom_threshold_5_percent(): """With custom threshold of 5%, should detect 6% change but not 4%.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem_above = MemoryStats(peak_memory_bytes=1060, total_allocations=10) # 6% head_mem_below = MemoryStats(peak_memory_bytes=1040, total_allocations=10) # 4% assert ( has_meaningful_memory_change(base_mem, head_mem_above, threshold_pct=5.0) is True ) # 1.50μs -> 1.22μs (22.8% faster) assert ( has_meaningful_memory_change(base_mem, head_mem_below, threshold_pct=5.0) is False ) # 852ns -> 841ns (1.31% faster) def test_both_metrics_change_above_threshold(): """When both memory and allocations change above threshold, should return True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1020, total_allocations=102) # both 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.46μs -> 1.19μs (22.7% faster) assert result is True def test_base_peak_memory_zero_head_not_zero(): """When base peak memory is zero, memory change cannot be calculated; check allocations only.""" base_mem = MemoryStats(peak_memory_bytes=0, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=102) # 2% allocation increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.49μs -> 1.39μs (7.18% faster) assert result is True def test_base_peak_memory_zero_allocations_same(): """When base peak memory is zero and allocations don't change significantly, should return False.""" base_mem = MemoryStats(peak_memory_bytes=0, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=101) # 1% allocation increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.53μs -> 1.37μs (11.7% faster) assert result is False def test_base_allocations_zero_memory_changes(): """When base allocations are zero, allocation change cannot be calculated; check memory only.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=0) head_mem = MemoryStats(peak_memory_bytes=1020, total_allocations=100) # 2% memory increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.48μs -> 1.27μs (16.5% faster) assert result is True def test_base_allocations_zero_memory_same(): """When base allocations are zero and memory doesn't change significantly, should return False.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=0) head_mem = MemoryStats(peak_memory_bytes=1005, total_allocations=100) # 0.5% memory increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.48μs -> 1.28μs (15.7% faster) assert result is False def test_very_small_base_memory(): """With very small base memory (1 byte), large percentage change should be detected.""" base_mem = MemoryStats(peak_memory_bytes=1, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=2, total_allocations=10) # 100% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.55μs -> 1.14μs (36.0% faster) assert result is True def test_large_base_memory(): """With large base memory, percentage change calculation should still work correctly.""" base_mem = MemoryStats(peak_memory_bytes=1_000_000_000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1_020_000_000, total_allocations=10) # 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.55μs -> 1.28μs (21.1% faster) assert result is True def test_threshold_exactly_at_boundary(): """When change is exactly at threshold boundary (e.g., 1.0%), should return False (not > threshold).""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1010, total_allocations=10) # exactly 1% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.75μs -> 1.46μs (19.8% faster) assert result is False def test_threshold_just_above_boundary(): """When change is just above threshold boundary (e.g., 1.01%), should return True.""" base_mem = MemoryStats(peak_memory_bytes=10000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=10101, total_allocations=10) # 1.01% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.47μs -> 1.28μs (14.8% faster) assert result is True def test_threshold_zero(): """With threshold_pct=0, any non-zero change should be detected.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1001, total_allocations=10) # 0.1% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=0.0) # 1.43μs -> 1.18μs (21.0% faster) assert result is True def test_threshold_zero_with_no_change(): """With threshold_pct=0 and no change, should return False.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=0.0) # 1.75μs -> 1.49μs (17.5% faster) assert result is False def test_negative_memory_change(): """Negative change in memory should be handled with absolute value.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=800, total_allocations=10) # 20% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.53μs -> 1.27μs (20.4% faster) assert result is True def test_negative_allocation_change(): """Negative change in allocations should be handled with absolute value.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=80) # 20% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.85μs -> 1.59μs (16.4% faster) assert result is True def test_threshold_very_large(): """With very large threshold_pct, no reasonable change should trigger True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=2000, total_allocations=10) # 100% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=101.0) # 1.82μs -> 1.51μs (20.6% faster) assert result is False def test_threshold_very_small(): """With very small threshold_pct, even tiny changes should trigger True.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1001, total_allocations=10) # 0.1% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=0.01) # 1.48μs -> 1.22μs (21.4% faster) assert result is True def test_head_peak_memory_zero_base_not_zero(): """When head peak memory is zero and base is not, it's a large decrease.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=0, total_allocations=10) # 100% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.52μs -> 1.41μs (7.86% faster) assert result is True def test_head_allocations_zero_base_not_zero(): """When head allocations are zero and base is not, it's a large decrease.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=0) # 100% decrease result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.82μs -> 1.57μs (16.0% faster) assert result is True def test_very_large_memory_values(): """Test with extremely large memory values (terabytes range).""" base_mem = MemoryStats(peak_memory_bytes=1_000_000_000_000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1_020_000_000_000, total_allocations=10) # 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.77μs -> 1.98μs (10.6% slower) assert result is True def test_very_large_allocation_values(): """Test with extremely large allocation counts.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=1_000_000_000) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=1_020_000_000) # 2% increase result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.82μs -> 1.63μs (11.7% faster) assert result is True def test_repeated_calls_with_same_input(): """Multiple calls with identical input should always return same result.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1010, total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 1.68μs -> 1.53μs (9.78% faster) assert result is True def test_repeated_calls_with_no_change(): """Multiple calls with no change should always return False.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem) # 1.62μs -> 1.30μs (24.6% faster) assert result is False def test_multiple_thresholds_with_same_data(): """Test the same data with multiple different thresholds.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) head_mem = MemoryStats(peak_memory_bytes=1050, total_allocations=100) # 5% increase thresholds = [0.1, 0.5, 1.0, 2.0, 4.0, 4.9, 5.0, 5.1, 10.0] results = [has_meaningful_memory_change(base_mem, head_mem, threshold_pct=t) for t in thresholds] # First 5 should be True (5% > 0.1%, 0.5%, 1%, 2%, 4%) # Last 4 should be False (5% <= 4.9%, 5%, 5.1%, 10%) assert results[:5] == [True, True, True, True, True] assert results[5:] == [False, False, False, False] def test_boundary_case_with_many_iterations(): """Test boundary conditions with varied base memory values.""" test_bases = [100, 1000, 10000, 100000, 1000000] for base_memory in test_bases: base_mem = MemoryStats(peak_memory_bytes=base_memory, total_allocations=10) head_mem = MemoryStats(peak_memory_bytes=int(base_memory * 1.01), total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=1.0) # 4.39μs -> 3.90μs (12.5% faster) assert result is False def test_range_of_allocation_increases(): """Test a range of allocation increase percentages to verify threshold logic.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=1000) # Test allocations from 0% to 5% increase for increase_pct in range(6): head_allocations = int(1000 * (1.0 + increase_pct / 100.0)) head_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=head_allocations) result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=2.0) # 4.67μs -> 4.18μs (11.7% faster) # Should return True only if increase > 2% if increase_pct > 2: assert result is True, f"Failed for {increase_pct}% increase" else: assert result is False, f"Failed for {increase_pct}% increase" def test_range_of_memory_increases(): """Test a range of memory increase percentages to verify threshold logic.""" base_mem = MemoryStats(peak_memory_bytes=10000, total_allocations=10) # Test memory from 0% to 5% increase for increase_pct in range(6): head_memory = int(10000 * (1.0 + increase_pct / 100.0)) head_mem = MemoryStats(peak_memory_bytes=head_memory, total_allocations=10) result = has_meaningful_memory_change(base_mem, head_mem, threshold_pct=2.0) # 4.41μs -> 3.81μs (15.8% faster) # Should return True only if increase > 2% if increase_pct > 2: assert result is True, f"Failed for {increase_pct}% increase" else: assert result is False, f"Failed for {increase_pct}% increase" def test_stress_test_none_inputs(): """Stress test with varied None inputs and memory configurations.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=10) head_mem_small = MemoryStats(peak_memory_bytes=500, total_allocations=5) head_mem_large = MemoryStats(peak_memory_bytes=2000, total_allocations=20) assert has_meaningful_memory_change(None, base_mem) is True # 512ns -> 521ns (1.73% slower) assert has_meaningful_memory_change(base_mem, None) is True # 310ns -> 310ns (0.000% faster) assert has_meaningful_memory_change(None, None) is False # 231ns -> 230ns (0.435% faster) assert ( has_meaningful_memory_change(base_mem, head_mem_small, threshold_pct=1.0) is True ) # 1.32μs -> 1.10μs (20.1% faster) assert ( has_meaningful_memory_change(base_mem, head_mem_large, threshold_pct=1.0) is True ) # 501ns -> 521ns (3.84% slower) assert ( has_meaningful_memory_change(head_mem_small, head_mem_large, threshold_pct=1.0) is True ) # 421ns -> 370ns (13.8% faster) def test_both_zero_repeated(): """Calls with both stats having zero values.""" base_mem = MemoryStats(peak_memory_bytes=0, total_allocations=0) head_mem = MemoryStats(peak_memory_bytes=0, total_allocations=0) result = has_meaningful_memory_change(base_mem, head_mem) # 521ns -> 582ns (10.5% slower) assert result is False def test_alternating_increase_decrease(): """Test patterns of both increase and decrease with different magnitudes.""" base_mem = MemoryStats(peak_memory_bytes=1000, total_allocations=100) # 2% increase case (above 1% threshold) head_mem_increase = MemoryStats(peak_memory_bytes=1020, total_allocations=102) result_increase = has_meaningful_memory_change( base_mem, head_mem_increase, threshold_pct=1.0 ) # 1.58μs -> 1.41μs (12.0% faster) assert result_increase is True # 2% decrease case (above 1% threshold) head_mem_decrease = MemoryStats(peak_memory_bytes=980, total_allocations=98) result_decrease = has_meaningful_memory_change( base_mem, head_mem_decrease, threshold_pct=1.0 ) # 712ns -> 722ns (1.39% slower) assert result_decrease is True # 0.5% increase case (below 1% threshold) head_mem_small = MemoryStats(peak_memory_bytes=1005, total_allocations=100) result_small = has_meaningful_memory_change( base_mem, head_mem_small, threshold_pct=1.0 ) # 802ns -> 711ns (12.8% faster) assert result_small is False

To test or edit this optimization locally git merge codeflash/optimize-pr1941-2026-04-02T17.07.34

Click to see suggested changes

Suggested change

if base_mem.peak_memory_bytes == 0 and head_mem.peak_memory_bytes == 0:

return False

if base_mem.peak_memory_bytes > 0:

mem_pct = abs((head_mem.peak_memory_bytes - base_mem.peak_memory_bytes) / base_mem.peak_memory_bytes) * 100

if mem_pct > threshold_pct:

return True

if base_mem.total_allocations > 0:

alloc_pct = abs((head_mem.total_allocations - base_mem.total_allocations) / base_mem.total_allocations) * 100

if alloc_pct > threshold_pct:

b_peak = base_mem.peak_memory_bytes

h_peak = head_mem.peak_memory_bytes

if b_peak == 0 and h_peak == 0:

return False

# When base peak is positive, check relative change without creating intermediate floats

if b_peak > 0:

# mem_pct > threshold_pct <=> abs(h_peak - b_peak) * 100 > threshold_pct * b_peak

if abs(h_peak - b_peak) * 100.0 > threshold_pct * b_peak:

return True

b_alloc = base_mem.total_allocations

if b_alloc > 0:

# alloc_pct > threshold_pct <=> abs(h_alloc - b_alloc) * 100 > threshold_pct * b_alloc

if abs(head_mem.total_allocations - b_alloc) * 100.0 > threshold_pct * b_alloc:

The hot path shows `logger.debug` consuming 18.3% of original runtime despite appearing infrequently (141 hits), because formatting the f-string occurs unconditionally even when debug logging is disabled. Wrapping it with `logger.isEnabledFor(logging.DEBUG)` defers string construction until confirmed necessary, eliminating wasteful formatting. Replacing `lambda x: x[3]` with `operator.itemgetter(3)` in the sort key reduces per-comparison overhead from a Python function call to a C-level attribute access, and hoisting the division constant `1_000_000.0` outside the loop avoids repeated float literal construction. Line profiler confirms the sort line dropped from 568 µs to 197 µs (65% faster) and the debug call from 1102 µs to 124 µs (89% faster), yielding a 45% overall speedup with no correctness or metric trade-offs.

codeflash-ai · 2026-04-02T18:51:05Z

⚡️ Codeflash found optimizations for this PR

📄 45% (0.45x) speedup for `validate_and_format_benchmark_table` in `codeflash/benchmarking/utils.py`

⏱️ Runtime : 1.26 milliseconds → 869 microseconds (best of 5 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function validate_and_format_benchmark_table by 45% in PR #1941 (cf-compare-copy-benchmarks) #1975

If you approve, it will be merged into this PR (branch cf-compare-copy-benchmarks).

…2026-04-02T18.50.56 ⚡️ Speed up function `validate_and_format_benchmark_table` by 45% in PR #1941 (`cf-compare-copy-benchmarks`)

codeflash-ai · 2026-04-03T02:22:45Z

This PR is now faster! 🚀 @claude[bot] accepted my optimizations from:

⚡️ Speed up function validate_and_format_benchmark_table by 45% in PR #1941 (cf-compare-copy-benchmarks) #1975

feat: copy benchmarks dir to base worktree when missing in compare

7226d8c

When the base ref predates the addition of benchmarks, the compare command now copies the benchmarks directory from the working tree so both refs can run.

KRRT7 force-pushed the cf-compare-copy-benchmarks branch from f6026a1 to 7226d8c Compare April 1, 2026 12:09

KRRT7 and others added 7 commits April 1, 2026 07:54

feat: auto-detect base and head refs in codeflash compare

c01f8cc

Running `codeflash compare` with no args now auto-detects: - head_ref from current branch - base_ref from PR base (via gh), repo default branch, or main/master

feat: add OPS, Max, IQR, and Outliers columns to compare output

a6cabf7

Matches pytest-benchmark's full statistical output in both Rich tables and markdown.

feat: add --output flag to codeflash compare for markdown export

4ccf21d

refactor: match pytest-benchmark column layout (Min, Median, Mean, OP…

92d4df7

…S, Rounds)

fix: extract median_ns from BenchmarkStats for optimizer pipeline

3e19151

get_benchmark_timings now returns BenchmarkStats instead of int. The optimizer pipeline expects float (nanoseconds), so extract median_ns at the boundary.

codeflash-ai bot mentioned this pull request Apr 1, 2026

⚡️ Speed up function fmt_delta by 11% in PR #1941 (cf-compare-copy-benchmarks) #1943

Merged

Merge pull request #1943 from codeflash-ai/codeflash/optimize-pr1941-…

7005fa0

…2026-04-01T14.15.33 ⚡️ Speed up function `fmt_delta` by 11% in PR #1941 (`cf-compare-copy-benchmarks`)

KRRT7 added 3 commits April 2, 2026 07:24

fix: update tests for multi-round benchmark plugin

74c29b2

The benchmark plugin now runs multiple rounds with calibrated iterations. Tests need SELECT DISTINCT for row counts and must extract median_ns from BenchmarkStats before validation.

feat: add --memory flag to codeflash compare for peak memory profiling

279a8fc

Adds a second profiling phase using pytest-memray that runs after timing benchmarks. Memory tables are suppressed when the delta is <1%.

KRRT7 changed the title ~~feat: copy benchmarks to base worktree when missing in compare~~ feat: add --memory flag and memory-only benchmarks to codeflash compare Apr 2, 2026

KRRT7 added 2 commits April 2, 2026 11:18

feat: add --script mode to codeflash compare

6965e98

Allows running arbitrary benchmark scripts on both git refs and rendering a styled comparison table. Supports optional --memory via memray wrapping. No codeflash config required for script mode.

codeflash-ai bot reviewed Apr 2, 2026

View reviewed changes

codeflash-ai bot mentioned this pull request Apr 2, 2026

⚡️ Speed up function validate_and_format_benchmark_table by 45% in PR #1941 (cf-compare-copy-benchmarks) #1975

Merged

github-actions bot and others added 2 commits April 2, 2026 18:53

style: remove duplicate comments introduced by optimization

cfcbae5

Merge pull request #1975 from codeflash-ai/codeflash/optimize-pr1941-…

ab728f7

…2026-04-02T18.50.56 ⚡️ Speed up function `validate_and_format_benchmark_table` by 45% in PR #1941 (`cf-compare-copy-benchmarks`)

KRRT7 changed the title ~~feat: add --memory flag and memory-only benchmarks to codeflash compare~~ feat: enhance codeflash compare with memory profiling, script mode, and auto-calibration Apr 3, 2026

KRRT7 merged commit accb245 into main Apr 3, 2026
27 of 29 checks passed

KRRT7 deleted the cf-compare-copy-benchmarks branch April 3, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enhance codeflash compare with memory profiling, script mode, and auto-calibration#1941

feat: enhance codeflash compare with memory profiling, script mode, and auto-calibration#1941
KRRT7 merged 17 commits intomainfrom
cf-compare-copy-benchmarks

KRRT7 commented Apr 1, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

codeflash-ai bot Apr 2, 2026

Uh oh!

codeflash-ai bot commented Apr 2, 2026

⚡️ Speed up function `validate_and_format_benchmark_table` by 45% in PR #1941 (`cf-compare-copy-benchmarks`) #1975

Uh oh!

codeflash-ai bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Test	Status
⚙️ Existing Unit Tests	✅ 23 Passed
🌀 Generated Regression Tests	✅ 97 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_compare.py::TestHasMeaningfulMemoryChange.test_both_none`	611ns	661ns	-7.56%⚠️
`test_compare.py::TestHasMeaningfulMemoryChange.test_both_zero`	571ns	601ns	-4.99%⚠️
`test_compare.py::TestHasMeaningfulMemoryChange.test_no_change`	1.70μs	1.62μs	4.93%✅
`test_compare.py::TestHasMeaningfulMemoryChange.test_one_none`	811ns	801ns	1.25%✅
`test_compare.py::TestHasMeaningfulMemoryChange.test_significant_alloc_change`	1.73μs	1.44μs	20.1%✅
`test_compare.py::TestHasMeaningfulMemoryChange.test_significant_peak_change`	1.38μs	1.28μs	7.88%✅

Conversation

KRRT7 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory-only benchmarks

Test plan

Uh oh!

claude bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Summary

Prek Checks

Code Review

Test Regression (must fix)

Potential Runtime Bug

Design: memray as a hard dependency

pytest_new_process_memory_benchmarks.py — module-level sys.argv access

Duplicate Detection

Test Coverage

Optimization PRs

Uh oh!

codeflash-ai bot commented Apr 1, 2026

Uh oh!

codeflash-ai bot commented Apr 1, 2026

⚡️ Codeflash found optimizations for this PR

📄 12% (0.12x) speedup for md_bar in codeflash/benchmarking/compare.py

A new Optimization Review has been created.

Uh oh!

codeflash-ai bot Apr 2, 2026

Choose a reason for hiding this comment

⚡️Codeflash found 12% (0.12x) speedup for has_meaningful_memory_change in codeflash/benchmarking/compare.py

Uh oh!

codeflash-ai bot commented Apr 2, 2026

⚡️ Codeflash found optimizations for this PR

📄 45% (0.45x) speedup for validate_and_format_benchmark_table in codeflash/benchmarking/utils.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function validate_and_format_benchmark_table by 45% in PR #1941 (cf-compare-copy-benchmarks) #1975

Uh oh!

codeflash-ai bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KRRT7 commented Apr 1, 2026 •

edited

Loading

claude bot commented Apr 1, 2026 •

edited

Loading

Design: `memray` as a hard dependency

`pytest_new_process_memory_benchmarks.py` — module-level `sys.argv` access

📄 12% (0.12x) speedup for `md_bar` in `codeflash/benchmarking/compare.py`

⚡️Codeflash found 12% (0.12x) speedup for `has_meaningful_memory_change` in `codeflash/benchmarking/compare.py`

📄 45% (0.45x) speedup for `validate_and_format_benchmark_table` in `codeflash/benchmarking/utils.py`

⚡️ Speed up function `validate_and_format_benchmark_table` by 45% in PR #1941 (`cf-compare-copy-benchmarks`) #1975