Skip to content

Improve backend benchmarking with hierarchical names, expanded workloads, and split CI suites#238

Open
PhilippGrulich wants to merge 1 commit intomainfrom
claude/improve-backend-benchmarking-HjDRt
Open

Improve backend benchmarking with hierarchical names, expanded workloads, and split CI suites#238
PhilippGrulich wants to merge 1 commit intomainfrom
claude/improve-backend-benchmarking-HjDRt

Conversation

@PhilippGrulich
Copy link
Copy Markdown
Member

  • Add BenchmarkUtil.hpp with shared getEnabledBackends() to eliminate
    duplicated #ifdef blocks across all benchmark files
  • Adopt hierarchical benchmark names (e.g., tracing/exception/add,
    pipeline/compile/mlir/fibonacci, execution/bc/collatz) so the
    benchmark-action UI can group metrics by category
  • Add Catch2 tags ([tracing], [pipeline], [execution], [tiered]) to
    enable running benchmark subsets independently
  • Expand ExecutionBenchmark from 3 to 8 workloads (add, fibonacci,
    sumLoop, collatz, nestedSumLoop, ifThenElse, gcd, arraySum) covering
    arithmetic, loops, branching, memory, and nested patterns
  • Add native C++ baselines (execution/native/*) for direct comparison
    against JIT-compiled backends
  • Add interpreted baselines (execution/interpreted/*) for overhead
    measurement
  • Split benchmark.yml into 4 named benchmark-action suites (Tracing
    Overhead, Compilation Pipeline, Execution Throughput, Tiered
    Compilation) so each gets its own chart on GitHub Pages
  • Increase chart history from 20 to 50 data points
  • Add 150% alert threshold for regression detection
  • Upload benchmark results as CI artifacts for debugging

https://claude.ai/code/session_01He1xQnUZMRThAb4wvQtiN1

@PhilippGrulich PhilippGrulich force-pushed the claude/improve-backend-benchmarking-HjDRt branch 2 times, most recently from 53f3830 to 0238c53 Compare April 12, 2026 13:44
…stom dashboard

Overhaul the benchmark infrastructure to enable meaningful backend
comparison, add plugin/intrinsic coverage, and provide a richer UI.

Benchmark infrastructure:
- Add BenchmarkUtil.hpp with shared getEnabledBackends() eliminating
  duplicated #ifdef blocks across all benchmark files
- Adopt hierarchical benchmark names (e.g., execution/mlir/fibonacci,
  pipeline/compile/cpp/gcd, plugins/simd/native/dotProduct)
- Add Catch2 tags ([tracing], [pipeline], [execution], [tiered],
  [plugins]) for running benchmark subsets independently
- Pre-compile all JIT functions once before Catch2 benchmark loops
  (fixes timeout caused by recompilation on every sample)

Execution benchmarks (ExecutionBenchmark.cpp):
- Expand from 3 to 7 workloads: add, fibonacci, sumLoop, collatz,
  ifThenElse, gcd, arraySum
- Add native C++ baselines (execution/native/*)
- Add interpreted baselines (execution/interpreted/*)
- Fix collatz int32_t overflow by using int64_t
- Skip arraySum on bc backend (too slow for 1M interpreted elements)

Pipeline benchmarks (TracingBenchmark.cpp):
- Pre-compute trace/SSA/IR once per function, cache IR for reuse
  across backends (eliminates redundant tracing per sample)
- Rename tracing contexts: trace -> exception, completing_trace -> lazy

Plugin intrinsic benchmarks (new):
- PluginBenchmark.cpp: math (sqrt, sin, cos, exp, log, pow, fma, floor,
  ceil, composite expr), bit (popcount, countl_zero, countr_zero,
  byteswap, rotl, composite bit-mix), memory (memcpy/memset at
  64B/4KB/1MB) — all across backends with native baselines
- SimdBenchmark.cpp: SIMD vector ops (vectorAdd, vectorMul, vectorFma,
  reduceAdd, dotProduct, distanceSquared, vectorAddInt, reduceAddInt)
  with native scalar baselines
- Separate nautilus-plugin-benchmarks executable linking nautilus-std
  and nautilus-simd

CI workflow (benchmark.yml):
- Split into 5 named benchmark-action suites (Tracing Overhead,
  Compilation Pipeline, Execution Throughput, Tiered Compilation,
  Plugin Intrinsics) — each gets its own chart on GitHub Pages
- First step fetches pages branch; steps 2-5 use skip-fetch-gh-pages
  to avoid non-fast-forward conflicts on the local ref
- Single push at the end with all data + custom dashboard
- Increase chart history from 20 to 50 data points
- Add 150% alert threshold for regression detection
- Upload benchmark results as CI artifacts

Custom dashboard (docs/benchmark/dashboard.html):
- Self-contained Chart.js page deployed to GitHub Pages
- Overview: grouped bar chart + speedup-vs-native table
- Per-category tabs: execution, pipeline, tracing, tiered, plugins
- Plugin intrinsics tab: math/bit/simd/memory charts + speedup table
- Historical trend charts across all suites

https://claude.ai/code/session_01He1xQnUZMRThAb4wvQtiN1
@PhilippGrulich PhilippGrulich force-pushed the claude/improve-backend-benchmarking-HjDRt branch from 0238c53 to 3f90273 Compare April 14, 2026 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants