Improve backend benchmarking with hierarchical names, expanded workloads, and split CI suites#238
Open
PhilippGrulich wants to merge 1 commit intomainfrom
Open
Improve backend benchmarking with hierarchical names, expanded workloads, and split CI suites#238PhilippGrulich wants to merge 1 commit intomainfrom
PhilippGrulich wants to merge 1 commit intomainfrom
Conversation
53f3830 to
0238c53
Compare
…stom dashboard Overhaul the benchmark infrastructure to enable meaningful backend comparison, add plugin/intrinsic coverage, and provide a richer UI. Benchmark infrastructure: - Add BenchmarkUtil.hpp with shared getEnabledBackends() eliminating duplicated #ifdef blocks across all benchmark files - Adopt hierarchical benchmark names (e.g., execution/mlir/fibonacci, pipeline/compile/cpp/gcd, plugins/simd/native/dotProduct) - Add Catch2 tags ([tracing], [pipeline], [execution], [tiered], [plugins]) for running benchmark subsets independently - Pre-compile all JIT functions once before Catch2 benchmark loops (fixes timeout caused by recompilation on every sample) Execution benchmarks (ExecutionBenchmark.cpp): - Expand from 3 to 7 workloads: add, fibonacci, sumLoop, collatz, ifThenElse, gcd, arraySum - Add native C++ baselines (execution/native/*) - Add interpreted baselines (execution/interpreted/*) - Fix collatz int32_t overflow by using int64_t - Skip arraySum on bc backend (too slow for 1M interpreted elements) Pipeline benchmarks (TracingBenchmark.cpp): - Pre-compute trace/SSA/IR once per function, cache IR for reuse across backends (eliminates redundant tracing per sample) - Rename tracing contexts: trace -> exception, completing_trace -> lazy Plugin intrinsic benchmarks (new): - PluginBenchmark.cpp: math (sqrt, sin, cos, exp, log, pow, fma, floor, ceil, composite expr), bit (popcount, countl_zero, countr_zero, byteswap, rotl, composite bit-mix), memory (memcpy/memset at 64B/4KB/1MB) — all across backends with native baselines - SimdBenchmark.cpp: SIMD vector ops (vectorAdd, vectorMul, vectorFma, reduceAdd, dotProduct, distanceSquared, vectorAddInt, reduceAddInt) with native scalar baselines - Separate nautilus-plugin-benchmarks executable linking nautilus-std and nautilus-simd CI workflow (benchmark.yml): - Split into 5 named benchmark-action suites (Tracing Overhead, Compilation Pipeline, Execution Throughput, Tiered Compilation, Plugin Intrinsics) — each gets its own chart on GitHub Pages - First step fetches pages branch; steps 2-5 use skip-fetch-gh-pages to avoid non-fast-forward conflicts on the local ref - Single push at the end with all data + custom dashboard - Increase chart history from 20 to 50 data points - Add 150% alert threshold for regression detection - Upload benchmark results as CI artifacts Custom dashboard (docs/benchmark/dashboard.html): - Self-contained Chart.js page deployed to GitHub Pages - Overview: grouped bar chart + speedup-vs-native table - Per-category tabs: execution, pipeline, tracing, tiered, plugins - Plugin intrinsics tab: math/bit/simd/memory charts + speedup table - Historical trend charts across all suites https://claude.ai/code/session_01He1xQnUZMRThAb4wvQtiN1
0238c53 to
3f90273
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
duplicated #ifdef blocks across all benchmark files
pipeline/compile/mlir/fibonacci, execution/bc/collatz) so the
benchmark-action UI can group metrics by category
enable running benchmark subsets independently
sumLoop, collatz, nestedSumLoop, ifThenElse, gcd, arraySum) covering
arithmetic, loops, branching, memory, and nested patterns
against JIT-compiled backends
measurement
Overhead, Compilation Pipeline, Execution Throughput, Tiered
Compilation) so each gets its own chart on GitHub Pages
https://claude.ai/code/session_01He1xQnUZMRThAb4wvQtiN1