Add nautilus-profiling plugin for runtime performance analysis by PhilippGrulich · Pull Request #244 · nebulastream/nautilus

PhilippGrulich · 2026-04-12T13:12:48Z

New plugin that provides cross-backend function profiling for Nautilus:

ProfiledFunction wrapper with zero overhead when profiling inactive
Thread-local call stacks and ring buffers for lock-free hot path
Chrome Trace Format export (viewable in Perfetto/chrome://tracing)
Folded stacks export (for Brendan Gregg's flamegraph.pl)
Perf map writer (/tmp/perf-.map) for JIT symbol resolution
Sampling support to reduce overhead for high-frequency calls
Optional CPU cycle collection via rdtsc (x86)

https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9

github-actions

Tracing Benchmark

Details

Benchmark suite	Current: `b889260`	Previous: `018a921`	Ratio
`trace_add`	`2.3814` us (`± 231.732`)	`2.40287` us (`± 287.781`)	`0.99`
`completing_trace_add`	`2.50615` us (`± 433.253`)	`2.37871` us (`± 348.935`)	`1.05`
`trace_ifThenElse`	`11.4709` us (`± 1.69924`)	`11.0567` us (`± 1.31609`)	`1.04`
`completing_trace_ifThenElse`	`5.08671` us (`± 642.17`)	`4.96633` us (`± 492.677`)	`1.02`
`trace_deeplyNestedIfElse`	`33.907` us (`± 6.2274`)	`33.3283` us (`± 3.72331`)	`1.02`
`completing_trace_deeplyNestedIfElse`	`22.5332` us (`± 3.04543`)	`14.7389` us (`± 2.36618`)	`1.53`
`trace_loop`	`10.9989` us (`± 1.52619`)	`11.4174` us (`± 1.84516`)	`0.96`
`completing_trace_loop`	`5.35757` us (`± 941.043`)	`5.07072` us (`± 688.523`)	`1.06`
`trace_ifInsideLoop`	`22.3528` us (`± 3.77478`)	`21.8217` us (`± 3.36396`)	`1.02`
`completing_trace_ifInsideLoop`	`9.28304` us (`± 1.11385`)	`9.37943` us (`± 1.64726`)	`0.99`
`trace_loopDirectCall`	`11.2732` us (`± 1.70474`)	`11.2594` us (`± 1.69256`)	`1.00`
`completing_trace_loopDirectCall`	`5.28109` us (`± 698.269`)	`5.01636` us (`± 613.598`)	`1.05`
`trace_pointerLoop`	`16.346` us (`± 2.94285`)	`16.9883` us (`± 3.6558`)	`0.96`
`completing_trace_pointerLoop`	`11.1447` us (`± 1.96757`)	`10.3234` us (`± 1.48734`)	`1.08`
`trace_staticLoop`	`9.47546` us (`± 1.16878`)	`11.4179` us (`± 1.50262`)	`0.83`
`completing_trace_staticLoop`	`9.45217` us (`± 1.54618`)	`10.2118` us (`± 1.40382`)	`0.93`
`trace_fibonacci`	`12.6897` us (`± 2.27286`)	`12.6525` us (`± 2.05563`)	`1.00`
`completing_trace_fibonacci`	`6.60695` us (`± 1.02724`)	`6.53429` us (`± 804.12`)	`1.01`
`trace_gcd`	`10.3277` us (`± 1.60794`)	`10.5287` us (`± 1.85354`)	`0.98`
`completing_trace_gcd`	`4.53296` us (`± 586.245`)	`4.43612` us (`± 660.368`)	`1.02`
`trace_nestedIf10`	`59.3288` us (`± 7.69771`)	`62.141` us (`± 13.7125`)	`0.95`
`completing_trace_nestedIf10`	`60.0057` us (`± 7.40374`)	`59.6267` us (`± 7.9754`)	`1.01`
`trace_nestedIf100`	`1.73314` ms (`± 60.5668`)	`1.72247` ms (`± 66.8355`)	`1.01`
`completing_trace_nestedIf100`	`1.79125` ms (`± 62.83`)	`1.77427` ms (`± 61.2359`)	`1.01`
`trace_chainedIf10`	`137.845` us (`± 12.2125`)	`137.168` us (`± 17.8157`)	`1.00`
`completing_trace_chainedIf10`	`72.2144` us (`± 13.1599`)	`70.1303` us (`± 10.357`)	`1.03`
`trace_chainedIf100`	`5.1015` ms (`± 77.4143`)	`5.08999` ms (`± 63.2729`)	`1.00`
`completing_trace_chainedIf100`	`2.66348` ms (`± 61.8628`)	`2.6717` ms (`± 62.2542`)	`1.00`
`comp_mlir_add`	`8.33335` ms (`± 159.834`)	`8.48512` ms (`± 244.662`)	`0.98`
`comp_mlir_ifThenElse`	`8.92078` ms (`± 156.007`)	`9.09679` ms (`± 190.418`)	`0.98`
`comp_mlir_deeplyNestedIfElse`	`7.81353` ms (`± 135.917`)	`7.89905` ms (`± 173.383`)	`0.99`
`comp_mlir_loop`	`9.96608` ms (`± 311.028`)	`10.1496` ms (`± 230.42`)	`0.98`
`comp_mlir_ifInsideLoop`	`32.0703` ms (`± 482.108`)	`32.388` ms (`± 435.684`)	`0.99`
`comp_mlir_loopDirectCall`	`14.7127` ms (`± 269.935`)	`14.9402` ms (`± 206.957`)	`0.98`
`comp_mlir_pointerLoop`	`30.9575` ms (`± 534.228`)	`31.3524` ms (`± 408.061`)	`0.99`
`comp_mlir_staticLoop`	`7.80078` ms (`± 157.875`)	`7.81756` ms (`± 161.29`)	`1.00`
`comp_mlir_fibonacci`	`13.3058` ms (`± 188.162`)	`13.8811` ms (`± 631.852`)	`0.96`
`comp_mlir_gcd`	`12.2529` ms (`± 170.578`)	`12.203` ms (`± 173.002`)	`1.00`
`comp_mlir_nestedIf10`	`13.3595` ms (`± 260.402`)	`13.2028` ms (`± 184.419`)	`1.01`
`comp_mlir_nestedIf100`	`27.685` ms (`± 323.049`)	`27.9993` ms (`± 545.846`)	`0.99`
`comp_mlir_chainedIf10`	`12.3306` ms (`± 253.886`)	`12.6389` ms (`± 303.089`)	`0.98`
`comp_mlir_chainedIf100`	`23.0489` ms (`± 318.557`)	`23.744` ms (`± 344.023`)	`0.97`
`comp_cpp_add`	`25.0292` ms (`± 559.584`)	`25.8209` ms (`± 628.014`)	`0.97`
`comp_cpp_ifThenElse`	`25.9164` ms (`± 533.82`)	`26.4155` ms (`± 565.394`)	`0.98`
`comp_cpp_deeplyNestedIfElse`	`27.1399` ms (`± 618.445`)	`26.8696` ms (`± 486.563`)	`1.01`
`comp_cpp_loop`	`26.0096` ms (`± 484.183`)	`26.0429` ms (`± 576.43`)	`1.00`
`comp_cpp_ifInsideLoop`	`27.1924` ms (`± 667.049`)	`26.3645` ms (`± 519.597`)	`1.03`
`comp_cpp_loopDirectCall`	`26.5718` ms (`± 617.311`)	`25.7393` ms (`± 438.859`)	`1.03`
`comp_cpp_pointerLoop`	`26.4797` ms (`± 507.878`)	`25.7937` ms (`± 569.907`)	`1.03`
`comp_cpp_staticLoop`	`25.6188` ms (`± 417.155`)	`25.5524` ms (`± 573.482`)	`1.00`
`comp_cpp_fibonacci`	`26.1646` ms (`± 694.59`)	`25.6873` ms (`± 669.172`)	`1.02`
`comp_cpp_gcd`	`26.434` ms (`± 660.491`)	`25.3463` ms (`± 478.98`)	`1.04`
`comp_cpp_nestedIf10`	`28.8039` ms (`± 477.691`)	`28.3012` ms (`± 454.98`)	`1.02`
`comp_cpp_nestedIf100`	`62.1752` ms (`± 605.21`)	`62.0761` ms (`± 1.27738`)	`1.00`
`comp_cpp_chainedIf10`	`31.2185` ms (`± 490.661`)	`30.9295` ms (`± 747.458`)	`1.01`
`comp_cpp_chainedIf100`	`91.9772` ms (`± 519.699`)	`91.2488` ms (`± 719.382`)	`1.01`
`comp_bc_add`	`14.4368` us (`± 2.54866`)	`14.3004` us (`± 2.32964`)	`1.01`
`comp_bc_ifThenElse`	`17.1303` us (`± 2.79817`)	`18.0949` us (`± 4.4491`)	`0.95`
`comp_bc_deeplyNestedIfElse`	`22.5405` us (`± 3.17576`)	`22.198` us (`± 3.26148`)	`1.02`
`comp_bc_loop`	`18.1043` us (`± 3.27463`)	`18.4165` us (`± 3.64301`)	`0.98`
`comp_bc_ifInsideLoop`	`20.8547` us (`± 3.5444`)	`20.6663` us (`± 2.96796`)	`1.01`
`comp_bc_loopDirectCall`	`18.8152` us (`± 2.87151`)	`18.7535` us (`± 4.45535`)	`1.00`
`comp_bc_pointerLoop`	`19.7746` us (`± 2.94881`)	`19.2815` us (`± 2.58919`)	`1.03`
`comp_bc_staticLoop`	`16.712` us (`± 3.11049`)	`16.5766` us (`± 3.64641`)	`1.01`
`comp_bc_fibonacci`	`18.6845` us (`± 3.57593`)	`18.0191` us (`± 2.73672`)	`1.04`
`comp_bc_gcd`	`16.9706` us (`± 2.76578`)	`17.8876` us (`± 3.47981`)	`0.95`
`comp_bc_nestedIf10`	`35.6601` us (`± 4.05481`)	`35.8101` us (`± 4.87022`)	`1.00`
`comp_bc_nestedIf100`	`183.04` us (`± 10.8907`)	`183.906` us (`± 10.5337`)	`1.00`
`comp_bc_chainedIf10`	`51.2047` us (`± 7.09329`)	`51.6162` us (`± 7.77583`)	`0.99`
`comp_bc_chainedIf100`	`282.947` us (`± 15.505`)	`295.087` us (`± 13.9286`)	`0.96`
`comp_asmjit_add`	`21.6908` us (`± 4.18288`)	`21.9139` us (`± 3.98743`)	`0.99`
`comp_asmjit_ifThenElse`	`34.9692` us (`± 5.60785`)	`35.0363` us (`± 5.16342`)	`1.00`
`comp_asmjit_deeplyNestedIfElse`	`60.4437` us (`± 13.5721`)	`59.9185` us (`± 9.08255`)	`1.01`
`comp_asmjit_loop`	`36.2086` us (`± 4.46217`)	`42.6871` us (`± 8.19297`)	`0.85`
`comp_asmjit_ifInsideLoop`	`63.2648` us (`± 13.0771`)	`59.8357` us (`± 9.79118`)	`1.06`
`comp_asmjit_loopDirectCall`	`46.524` us (`± 7.93802`)	`47.437` us (`± 8.15676`)	`0.98`
`comp_asmjit_pointerLoop`	`49.8837` us (`± 9.41894`)	`50.3213` us (`± 9.40177`)	`0.99`
`comp_asmjit_staticLoop`	`29.0133` us (`± 4.47306`)	`28.9683` us (`± 4.73376`)	`1.00`
`comp_asmjit_fibonacci`	`45.0761` us (`± 8.60741`)	`45.5753` us (`± 8.89015`)	`0.99`
`comp_asmjit_gcd`	`36.4401` us (`± 5.74717`)	`36.4937` us (`± 5.22977`)	`1.00`
`comp_asmjit_nestedIf10`	`109.118` us (`± 12.9306`)	`113.761` us (`± 17.05`)	`0.96`
`comp_asmjit_nestedIf100`	`1.13697` ms (`± 27.5682`)	`1.17881` ms (`± 31.5389`)	`0.96`
`comp_asmjit_chainedIf10`	`163.653` us (`± 14.1508`)	`167.036` us (`± 13.8454`)	`0.98`
`comp_asmjit_chainedIf100`	`2.26125` ms (`± 48.699`)	`2.28964` ms (`± 33.8526`)	`0.99`
`ir_add`	`915.503` ns (`± 133.434`)	`864.571` ns (`± 91.557`)	`1.06`
`ir_ifThenElse`	`2.61747` us (`± 376.662`)	`2.57735` us (`± 319.73`)	`1.02`
`ir_deeplyNestedIfElse`	`7.08122` us (`± 1.12608`)	`6.89755` us (`± 897.996`)	`1.03`
`ir_loop`	`3.11804` us (`± 380.98`)	`3.10713` us (`± 431.249`)	`1.00`
`ir_ifInsideLoop`	`5.87999` us (`± 652.3`)	`5.92666` us (`± 746.789`)	`0.99`
`ir_loopDirectCall`	`3.44648` us (`± 496.331`)	`3.30004` us (`± 301.857`)	`1.04`
`ir_pointerLoop`	`4.01372` us (`± 610.745`)	`3.85236` us (`± 474.74`)	`1.04`
`ir_staticLoop`	`2.44905` us (`± 454.408`)	`2.30272` us (`± 223.316`)	`1.06`
`ir_fibonacci`	`3.34773` us (`± 518.912`)	`3.27638` us (`± 425.306`)	`1.02`
`ir_gcd`	`2.77866` us (`± 381.841`)	`2.65842` us (`± 228.618`)	`1.05`
`ir_nestedIf10`	`16.7884` us (`± 1.8008`)	`16.4102` us (`± 2.05049`)	`1.02`
`ir_nestedIf100`	`191.626` us (`± 8.21905`)	`193.63` us (`± 6.82735`)	`0.99`
`ir_chainedIf10`	`30.0649` us (`± 2.99335`)	`29.6755` us (`± 2.89174`)	`1.01`
`ir_chainedIf100`	`369.322` us (`± 11.2126`)	`365.381` us (`± 11.6014`)	`1.01`
`ssa_add`	`205.072` ns (`± 33.8899`)	`196.22` ns (`± 22.809`)	`1.05`
`ssa_ifThenElse`	`512.556` ns (`± 77.9121`)	`475.7` ns (`± 40.9818`)	`1.08`
`ssa_deeplyNestedIfElse`	`1.23927` us (`± 123.27`)	`1.22973` us (`± 127.069`)	`1.01`
`ssa_loop`	`535.921` ns (`± 75.3232`)	`505.673` ns (`± 57.5818`)	`1.06`
`ssa_ifInsideLoop`	`965.397` ns (`± 99.8991`)	`978.531` ns (`± 117.327`)	`0.99`
`ssa_loopDirectCall`	`529.281` ns (`± 63.8081`)	`500.025` ns (`± 44.5004`)	`1.06`
`ssa_pointerLoop`	`633.994` ns (`± 69.0839`)	`623.746` ns (`± 62.4338`)	`1.02`
`ssa_staticLoop`	`552.406` ns (`± 77.1713`)	`550.173` ns (`± 89.5966`)	`1.00`
`ssa_fibonacci`	`531.615` ns (`± 54.0431`)	`555.672` ns (`± 94.9023`)	`0.96`
`ssa_gcd`	`462.974` ns (`± 34.9472`)	`520.723` ns (`± 168.769`)	`0.89`
`tiered_compile_addOne`	`41.3136` us (`± 10.6995`)	`41.6425` us (`± 11.2354`)	`0.99`
`single_compile_mlir_addOne`	`6.52105` ms (`± 216.617`)	`6.48998` ms (`± 329.093`)	`1.00`
`single_compile_cpp_addOne`	`26.1465` ms (`± 557.56`)	`26.1906` ms (`± 780.711`)	`1.00`
`single_compile_bc_addOne`	`42.5415` us (`± 12.4946`)	`42.3836` us (`± 11.0238`)	`1.00`
`tiered_compile_sumLoop`	`61.71` us (`± 13.0839`)	`61.1521` us (`± 12.1078`)	`1.01`
`single_compile_mlir_sumLoop`	`8.57714` ms (`± 236.982`)	`8.67139` ms (`± 439.258`)	`0.99`
`single_compile_cpp_sumLoop`	`26.7041` ms (`± 861.181`)	`26.6711` ms (`± 606.446`)	`1.00`
`single_compile_bc_sumLoop`	`62.3097` us (`± 13.6451`)	`62.8708` us (`± 14.5141`)	`0.99`
`e2e_tiered_bc_to_mlir`	`42.889` us (`± 12.4874`)	`43.2149` us (`± 13.5364`)	`0.99`
`e2e_single_mlir`	`8.35564` ms (`± 208.594`)	`8.24677` ms (`± 145.888`)	`1.01`
`exec_mlir_add`	`9.44482` ns (`± 0.537266`)	`9.73747` ns (`± 0.898217`)	`0.97`
`exec_mlir_fibonacci`	`13.5948` us (`± 1.26605`)	`13.6152` us (`± 1.45369`)	`1.00`
`exec_mlir_sum`	`575.442` us (`± 22.2906`)	`571.848` us (`± 25.1486`)	`1.01`
`exec_cpp_add`	`4.66317` ns (`± 0.530745`)	`4.84943` ns (`± 1.04744`)	`0.96`
`exec_cpp_fibonacci`	`96.4577` us (`± 8.40976`)	`96.8868` us (`± 9.72657`)	`1.00`
`exec_cpp_sum`	`35.9518` ms (`± 96.2803`)	`36.1302` ms (`± 440.313`)	`1.00`
`exec_bc_add`	`44.2286` ns (`± 5.80522`)	`42.1954` ns (`± 3.82737`)	`1.05`
`exec_bc_fibonacci`	`900.068` us (`± 7.13396`)	`906.298` us (`± 30.0336`)	`0.99`
`exec_bc_sum`	`190.755` ms (`± 388.855`)	`190.715` ms (`± 498.184`)	`1.00`
`exec_asmjit_add`	`3.20099` ns (`± 0.290477`)	`3.27822` ns (`± 0.499463`)	`0.98`
`exec_asmjit_fibonacci`	`21.4862` us (`± 2.2355`)	`21.9394` us (`± 3.5088`)	`0.98`
`exec_asmjit_sum`	`4.59414` ms (`± 15.061`)	`4.61489` ms (`± 20.3506`)	`1.00`
`exec_bc_addOne`	`39.0611` ns (`± 8.686`)	`35.4491` ns (`± 6.68627`)	`1.10`
`exec_mlir_addOne`	`283.751` ns (`± 5.97818`)	`291.212` ns (`± 10.1709`)	`0.97`
`exec_cpp_addOne`	`4.05063` ns (`± 0.719371`)	`3.94796` ns (`± 0.514023`)	`1.03`
`exec_interpreted_addOne`	`37.3876` ns (`± 1.81838`)	`38.1409` ns (`± 2.13248`)	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

New plugin that profiles code segments inside compiled Nautilus functions. ProfileRegion::start()/stop() calls compile into the generated code via invoke(), working across all backends (MLIR, C++, Bytecode, AsmJit). Core API - ProfileRegion: static ProfileRegion normalize("normalize"); val<int32_t> pipeline(val<int32_t*> data, val<int32_t> size) { normalize.start(); // code + invoke(native_func, ...) ... normalize.stop(); return size; } Features: - Thread-safe: entry timestamps in thread-local storage, atomics for aggregates - Hardware counters via perf_event_open (Linux): cycles, instructions, cache/branch misses - Perf correlation: records [start,stop] intervals, correlates with `perf script` samples to produce flame charts with full native call stacks scoped to regions - Programmatic perf: Profiler::start()/stop() auto-forks perf record (Linux) or sample (macOS), parses output, and correlates on stop - Export: Chrome Trace JSON (Perfetto), folded stacks (flamegraph.pl) - Graceful degradation when perf/sample unavailable (containers, VMs) Tests: 10 test cases, 475 assertions across all backends https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9

github-actions bot reviewed Apr 12, 2026

View reviewed changes

PhilippGrulich force-pushed the claude/nautilus-profiling-plugin-AGwCD branch from 1577316 to b889260 Compare April 12, 2026 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nautilus-profiling plugin for runtime performance analysis#244

Add nautilus-profiling plugin for runtime performance analysis#244
PhilippGrulich wants to merge 1 commit intomainfrom
claude/nautilus-profiling-plugin-AGwCD

PhilippGrulich commented Apr 12, 2026

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PhilippGrulich commented Apr 12, 2026

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Tracing Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot left a comment •

edited

Loading