Skip to content

Add nautilus-profiling plugin for runtime performance analysis#244

Open
PhilippGrulich wants to merge 1 commit intomainfrom
claude/nautilus-profiling-plugin-AGwCD
Open

Add nautilus-profiling plugin for runtime performance analysis#244
PhilippGrulich wants to merge 1 commit intomainfrom
claude/nautilus-profiling-plugin-AGwCD

Conversation

@PhilippGrulich
Copy link
Copy Markdown
Member

New plugin that provides cross-backend function profiling for Nautilus:

  • ProfiledFunction wrapper with zero overhead when profiling inactive
  • Thread-local call stacks and ring buffers for lock-free hot path
  • Chrome Trace Format export (viewable in Perfetto/chrome://tracing)
  • Folded stacks export (for Brendan Gregg's flamegraph.pl)
  • Perf map writer (/tmp/perf-.map) for JIT symbol resolution
  • Sampling support to reduce overhead for high-frequency calls
  • Optional CPU cycle collection via rdtsc (x86)

https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing Benchmark

Details
Benchmark suite Current: b889260 Previous: 018a921 Ratio
trace_add 2.3814 us (± 231.732) 2.40287 us (± 287.781) 0.99
completing_trace_add 2.50615 us (± 433.253) 2.37871 us (± 348.935) 1.05
trace_ifThenElse 11.4709 us (± 1.69924) 11.0567 us (± 1.31609) 1.04
completing_trace_ifThenElse 5.08671 us (± 642.17) 4.96633 us (± 492.677) 1.02
trace_deeplyNestedIfElse 33.907 us (± 6.2274) 33.3283 us (± 3.72331) 1.02
completing_trace_deeplyNestedIfElse 22.5332 us (± 3.04543) 14.7389 us (± 2.36618) 1.53
trace_loop 10.9989 us (± 1.52619) 11.4174 us (± 1.84516) 0.96
completing_trace_loop 5.35757 us (± 941.043) 5.07072 us (± 688.523) 1.06
trace_ifInsideLoop 22.3528 us (± 3.77478) 21.8217 us (± 3.36396) 1.02
completing_trace_ifInsideLoop 9.28304 us (± 1.11385) 9.37943 us (± 1.64726) 0.99
trace_loopDirectCall 11.2732 us (± 1.70474) 11.2594 us (± 1.69256) 1.00
completing_trace_loopDirectCall 5.28109 us (± 698.269) 5.01636 us (± 613.598) 1.05
trace_pointerLoop 16.346 us (± 2.94285) 16.9883 us (± 3.6558) 0.96
completing_trace_pointerLoop 11.1447 us (± 1.96757) 10.3234 us (± 1.48734) 1.08
trace_staticLoop 9.47546 us (± 1.16878) 11.4179 us (± 1.50262) 0.83
completing_trace_staticLoop 9.45217 us (± 1.54618) 10.2118 us (± 1.40382) 0.93
trace_fibonacci 12.6897 us (± 2.27286) 12.6525 us (± 2.05563) 1.00
completing_trace_fibonacci 6.60695 us (± 1.02724) 6.53429 us (± 804.12) 1.01
trace_gcd 10.3277 us (± 1.60794) 10.5287 us (± 1.85354) 0.98
completing_trace_gcd 4.53296 us (± 586.245) 4.43612 us (± 660.368) 1.02
trace_nestedIf10 59.3288 us (± 7.69771) 62.141 us (± 13.7125) 0.95
completing_trace_nestedIf10 60.0057 us (± 7.40374) 59.6267 us (± 7.9754) 1.01
trace_nestedIf100 1.73314 ms (± 60.5668) 1.72247 ms (± 66.8355) 1.01
completing_trace_nestedIf100 1.79125 ms (± 62.83) 1.77427 ms (± 61.2359) 1.01
trace_chainedIf10 137.845 us (± 12.2125) 137.168 us (± 17.8157) 1.00
completing_trace_chainedIf10 72.2144 us (± 13.1599) 70.1303 us (± 10.357) 1.03
trace_chainedIf100 5.1015 ms (± 77.4143) 5.08999 ms (± 63.2729) 1.00
completing_trace_chainedIf100 2.66348 ms (± 61.8628) 2.6717 ms (± 62.2542) 1.00
comp_mlir_add 8.33335 ms (± 159.834) 8.48512 ms (± 244.662) 0.98
comp_mlir_ifThenElse 8.92078 ms (± 156.007) 9.09679 ms (± 190.418) 0.98
comp_mlir_deeplyNestedIfElse 7.81353 ms (± 135.917) 7.89905 ms (± 173.383) 0.99
comp_mlir_loop 9.96608 ms (± 311.028) 10.1496 ms (± 230.42) 0.98
comp_mlir_ifInsideLoop 32.0703 ms (± 482.108) 32.388 ms (± 435.684) 0.99
comp_mlir_loopDirectCall 14.7127 ms (± 269.935) 14.9402 ms (± 206.957) 0.98
comp_mlir_pointerLoop 30.9575 ms (± 534.228) 31.3524 ms (± 408.061) 0.99
comp_mlir_staticLoop 7.80078 ms (± 157.875) 7.81756 ms (± 161.29) 1.00
comp_mlir_fibonacci 13.3058 ms (± 188.162) 13.8811 ms (± 631.852) 0.96
comp_mlir_gcd 12.2529 ms (± 170.578) 12.203 ms (± 173.002) 1.00
comp_mlir_nestedIf10 13.3595 ms (± 260.402) 13.2028 ms (± 184.419) 1.01
comp_mlir_nestedIf100 27.685 ms (± 323.049) 27.9993 ms (± 545.846) 0.99
comp_mlir_chainedIf10 12.3306 ms (± 253.886) 12.6389 ms (± 303.089) 0.98
comp_mlir_chainedIf100 23.0489 ms (± 318.557) 23.744 ms (± 344.023) 0.97
comp_cpp_add 25.0292 ms (± 559.584) 25.8209 ms (± 628.014) 0.97
comp_cpp_ifThenElse 25.9164 ms (± 533.82) 26.4155 ms (± 565.394) 0.98
comp_cpp_deeplyNestedIfElse 27.1399 ms (± 618.445) 26.8696 ms (± 486.563) 1.01
comp_cpp_loop 26.0096 ms (± 484.183) 26.0429 ms (± 576.43) 1.00
comp_cpp_ifInsideLoop 27.1924 ms (± 667.049) 26.3645 ms (± 519.597) 1.03
comp_cpp_loopDirectCall 26.5718 ms (± 617.311) 25.7393 ms (± 438.859) 1.03
comp_cpp_pointerLoop 26.4797 ms (± 507.878) 25.7937 ms (± 569.907) 1.03
comp_cpp_staticLoop 25.6188 ms (± 417.155) 25.5524 ms (± 573.482) 1.00
comp_cpp_fibonacci 26.1646 ms (± 694.59) 25.6873 ms (± 669.172) 1.02
comp_cpp_gcd 26.434 ms (± 660.491) 25.3463 ms (± 478.98) 1.04
comp_cpp_nestedIf10 28.8039 ms (± 477.691) 28.3012 ms (± 454.98) 1.02
comp_cpp_nestedIf100 62.1752 ms (± 605.21) 62.0761 ms (± 1.27738) 1.00
comp_cpp_chainedIf10 31.2185 ms (± 490.661) 30.9295 ms (± 747.458) 1.01
comp_cpp_chainedIf100 91.9772 ms (± 519.699) 91.2488 ms (± 719.382) 1.01
comp_bc_add 14.4368 us (± 2.54866) 14.3004 us (± 2.32964) 1.01
comp_bc_ifThenElse 17.1303 us (± 2.79817) 18.0949 us (± 4.4491) 0.95
comp_bc_deeplyNestedIfElse 22.5405 us (± 3.17576) 22.198 us (± 3.26148) 1.02
comp_bc_loop 18.1043 us (± 3.27463) 18.4165 us (± 3.64301) 0.98
comp_bc_ifInsideLoop 20.8547 us (± 3.5444) 20.6663 us (± 2.96796) 1.01
comp_bc_loopDirectCall 18.8152 us (± 2.87151) 18.7535 us (± 4.45535) 1.00
comp_bc_pointerLoop 19.7746 us (± 2.94881) 19.2815 us (± 2.58919) 1.03
comp_bc_staticLoop 16.712 us (± 3.11049) 16.5766 us (± 3.64641) 1.01
comp_bc_fibonacci 18.6845 us (± 3.57593) 18.0191 us (± 2.73672) 1.04
comp_bc_gcd 16.9706 us (± 2.76578) 17.8876 us (± 3.47981) 0.95
comp_bc_nestedIf10 35.6601 us (± 4.05481) 35.8101 us (± 4.87022) 1.00
comp_bc_nestedIf100 183.04 us (± 10.8907) 183.906 us (± 10.5337) 1.00
comp_bc_chainedIf10 51.2047 us (± 7.09329) 51.6162 us (± 7.77583) 0.99
comp_bc_chainedIf100 282.947 us (± 15.505) 295.087 us (± 13.9286) 0.96
comp_asmjit_add 21.6908 us (± 4.18288) 21.9139 us (± 3.98743) 0.99
comp_asmjit_ifThenElse 34.9692 us (± 5.60785) 35.0363 us (± 5.16342) 1.00
comp_asmjit_deeplyNestedIfElse 60.4437 us (± 13.5721) 59.9185 us (± 9.08255) 1.01
comp_asmjit_loop 36.2086 us (± 4.46217) 42.6871 us (± 8.19297) 0.85
comp_asmjit_ifInsideLoop 63.2648 us (± 13.0771) 59.8357 us (± 9.79118) 1.06
comp_asmjit_loopDirectCall 46.524 us (± 7.93802) 47.437 us (± 8.15676) 0.98
comp_asmjit_pointerLoop 49.8837 us (± 9.41894) 50.3213 us (± 9.40177) 0.99
comp_asmjit_staticLoop 29.0133 us (± 4.47306) 28.9683 us (± 4.73376) 1.00
comp_asmjit_fibonacci 45.0761 us (± 8.60741) 45.5753 us (± 8.89015) 0.99
comp_asmjit_gcd 36.4401 us (± 5.74717) 36.4937 us (± 5.22977) 1.00
comp_asmjit_nestedIf10 109.118 us (± 12.9306) 113.761 us (± 17.05) 0.96
comp_asmjit_nestedIf100 1.13697 ms (± 27.5682) 1.17881 ms (± 31.5389) 0.96
comp_asmjit_chainedIf10 163.653 us (± 14.1508) 167.036 us (± 13.8454) 0.98
comp_asmjit_chainedIf100 2.26125 ms (± 48.699) 2.28964 ms (± 33.8526) 0.99
ir_add 915.503 ns (± 133.434) 864.571 ns (± 91.557) 1.06
ir_ifThenElse 2.61747 us (± 376.662) 2.57735 us (± 319.73) 1.02
ir_deeplyNestedIfElse 7.08122 us (± 1.12608) 6.89755 us (± 897.996) 1.03
ir_loop 3.11804 us (± 380.98) 3.10713 us (± 431.249) 1.00
ir_ifInsideLoop 5.87999 us (± 652.3) 5.92666 us (± 746.789) 0.99
ir_loopDirectCall 3.44648 us (± 496.331) 3.30004 us (± 301.857) 1.04
ir_pointerLoop 4.01372 us (± 610.745) 3.85236 us (± 474.74) 1.04
ir_staticLoop 2.44905 us (± 454.408) 2.30272 us (± 223.316) 1.06
ir_fibonacci 3.34773 us (± 518.912) 3.27638 us (± 425.306) 1.02
ir_gcd 2.77866 us (± 381.841) 2.65842 us (± 228.618) 1.05
ir_nestedIf10 16.7884 us (± 1.8008) 16.4102 us (± 2.05049) 1.02
ir_nestedIf100 191.626 us (± 8.21905) 193.63 us (± 6.82735) 0.99
ir_chainedIf10 30.0649 us (± 2.99335) 29.6755 us (± 2.89174) 1.01
ir_chainedIf100 369.322 us (± 11.2126) 365.381 us (± 11.6014) 1.01
ssa_add 205.072 ns (± 33.8899) 196.22 ns (± 22.809) 1.05
ssa_ifThenElse 512.556 ns (± 77.9121) 475.7 ns (± 40.9818) 1.08
ssa_deeplyNestedIfElse 1.23927 us (± 123.27) 1.22973 us (± 127.069) 1.01
ssa_loop 535.921 ns (± 75.3232) 505.673 ns (± 57.5818) 1.06
ssa_ifInsideLoop 965.397 ns (± 99.8991) 978.531 ns (± 117.327) 0.99
ssa_loopDirectCall 529.281 ns (± 63.8081) 500.025 ns (± 44.5004) 1.06
ssa_pointerLoop 633.994 ns (± 69.0839) 623.746 ns (± 62.4338) 1.02
ssa_staticLoop 552.406 ns (± 77.1713) 550.173 ns (± 89.5966) 1.00
ssa_fibonacci 531.615 ns (± 54.0431) 555.672 ns (± 94.9023) 0.96
ssa_gcd 462.974 ns (± 34.9472) 520.723 ns (± 168.769) 0.89
tiered_compile_addOne 41.3136 us (± 10.6995) 41.6425 us (± 11.2354) 0.99
single_compile_mlir_addOne 6.52105 ms (± 216.617) 6.48998 ms (± 329.093) 1.00
single_compile_cpp_addOne 26.1465 ms (± 557.56) 26.1906 ms (± 780.711) 1.00
single_compile_bc_addOne 42.5415 us (± 12.4946) 42.3836 us (± 11.0238) 1.00
tiered_compile_sumLoop 61.71 us (± 13.0839) 61.1521 us (± 12.1078) 1.01
single_compile_mlir_sumLoop 8.57714 ms (± 236.982) 8.67139 ms (± 439.258) 0.99
single_compile_cpp_sumLoop 26.7041 ms (± 861.181) 26.6711 ms (± 606.446) 1.00
single_compile_bc_sumLoop 62.3097 us (± 13.6451) 62.8708 us (± 14.5141) 0.99
e2e_tiered_bc_to_mlir 42.889 us (± 12.4874) 43.2149 us (± 13.5364) 0.99
e2e_single_mlir 8.35564 ms (± 208.594) 8.24677 ms (± 145.888) 1.01
exec_mlir_add 9.44482 ns (± 0.537266) 9.73747 ns (± 0.898217) 0.97
exec_mlir_fibonacci 13.5948 us (± 1.26605) 13.6152 us (± 1.45369) 1.00
exec_mlir_sum 575.442 us (± 22.2906) 571.848 us (± 25.1486) 1.01
exec_cpp_add 4.66317 ns (± 0.530745) 4.84943 ns (± 1.04744) 0.96
exec_cpp_fibonacci 96.4577 us (± 8.40976) 96.8868 us (± 9.72657) 1.00
exec_cpp_sum 35.9518 ms (± 96.2803) 36.1302 ms (± 440.313) 1.00
exec_bc_add 44.2286 ns (± 5.80522) 42.1954 ns (± 3.82737) 1.05
exec_bc_fibonacci 900.068 us (± 7.13396) 906.298 us (± 30.0336) 0.99
exec_bc_sum 190.755 ms (± 388.855) 190.715 ms (± 498.184) 1.00
exec_asmjit_add 3.20099 ns (± 0.290477) 3.27822 ns (± 0.499463) 0.98
exec_asmjit_fibonacci 21.4862 us (± 2.2355) 21.9394 us (± 3.5088) 0.98
exec_asmjit_sum 4.59414 ms (± 15.061) 4.61489 ms (± 20.3506) 1.00
exec_bc_addOne 39.0611 ns (± 8.686) 35.4491 ns (± 6.68627) 1.10
exec_mlir_addOne 283.751 ns (± 5.97818) 291.212 ns (± 10.1709) 0.97
exec_cpp_addOne 4.05063 ns (± 0.719371) 3.94796 ns (± 0.514023) 1.03
exec_interpreted_addOne 37.3876 ns (± 1.81838) 38.1409 ns (± 2.13248) 0.98

This comment was automatically generated by workflow using github-action-benchmark.

New plugin that profiles code segments inside compiled Nautilus functions.
ProfileRegion::start()/stop() calls compile into the generated code via
invoke(), working across all backends (MLIR, C++, Bytecode, AsmJit).

Core API - ProfileRegion:
  static ProfileRegion normalize("normalize");
  val<int32_t> pipeline(val<int32_t*> data, val<int32_t> size) {
      normalize.start();
      // code + invoke(native_func, ...) ...
      normalize.stop();
      return size;
  }

Features:
- Thread-safe: entry timestamps in thread-local storage, atomics for aggregates
- Hardware counters via perf_event_open (Linux): cycles, instructions, cache/branch misses
- Perf correlation: records [start,stop] intervals, correlates with `perf script`
  samples to produce flame charts with full native call stacks scoped to regions
- Programmatic perf: Profiler::start()/stop() auto-forks perf record (Linux)
  or sample (macOS), parses output, and correlates on stop
- Export: Chrome Trace JSON (Perfetto), folded stacks (flamegraph.pl)
- Graceful degradation when perf/sample unavailable (containers, VMs)

Tests: 10 test cases, 475 assertions across all backends

https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9
@PhilippGrulich PhilippGrulich force-pushed the claude/nautilus-profiling-plugin-AGwCD branch from 1577316 to b889260 Compare April 12, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants