Add nautilus-profiling plugin for runtime performance analysis#244
Open
PhilippGrulich wants to merge 1 commit intomainfrom
Open
Add nautilus-profiling plugin for runtime performance analysis#244PhilippGrulich wants to merge 1 commit intomainfrom
PhilippGrulich wants to merge 1 commit intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Tracing Benchmark
Details
| Benchmark suite | Current: b889260 | Previous: 018a921 | Ratio |
|---|---|---|---|
trace_add |
2.3814 us (± 231.732) |
2.40287 us (± 287.781) |
0.99 |
completing_trace_add |
2.50615 us (± 433.253) |
2.37871 us (± 348.935) |
1.05 |
trace_ifThenElse |
11.4709 us (± 1.69924) |
11.0567 us (± 1.31609) |
1.04 |
completing_trace_ifThenElse |
5.08671 us (± 642.17) |
4.96633 us (± 492.677) |
1.02 |
trace_deeplyNestedIfElse |
33.907 us (± 6.2274) |
33.3283 us (± 3.72331) |
1.02 |
completing_trace_deeplyNestedIfElse |
22.5332 us (± 3.04543) |
14.7389 us (± 2.36618) |
1.53 |
trace_loop |
10.9989 us (± 1.52619) |
11.4174 us (± 1.84516) |
0.96 |
completing_trace_loop |
5.35757 us (± 941.043) |
5.07072 us (± 688.523) |
1.06 |
trace_ifInsideLoop |
22.3528 us (± 3.77478) |
21.8217 us (± 3.36396) |
1.02 |
completing_trace_ifInsideLoop |
9.28304 us (± 1.11385) |
9.37943 us (± 1.64726) |
0.99 |
trace_loopDirectCall |
11.2732 us (± 1.70474) |
11.2594 us (± 1.69256) |
1.00 |
completing_trace_loopDirectCall |
5.28109 us (± 698.269) |
5.01636 us (± 613.598) |
1.05 |
trace_pointerLoop |
16.346 us (± 2.94285) |
16.9883 us (± 3.6558) |
0.96 |
completing_trace_pointerLoop |
11.1447 us (± 1.96757) |
10.3234 us (± 1.48734) |
1.08 |
trace_staticLoop |
9.47546 us (± 1.16878) |
11.4179 us (± 1.50262) |
0.83 |
completing_trace_staticLoop |
9.45217 us (± 1.54618) |
10.2118 us (± 1.40382) |
0.93 |
trace_fibonacci |
12.6897 us (± 2.27286) |
12.6525 us (± 2.05563) |
1.00 |
completing_trace_fibonacci |
6.60695 us (± 1.02724) |
6.53429 us (± 804.12) |
1.01 |
trace_gcd |
10.3277 us (± 1.60794) |
10.5287 us (± 1.85354) |
0.98 |
completing_trace_gcd |
4.53296 us (± 586.245) |
4.43612 us (± 660.368) |
1.02 |
trace_nestedIf10 |
59.3288 us (± 7.69771) |
62.141 us (± 13.7125) |
0.95 |
completing_trace_nestedIf10 |
60.0057 us (± 7.40374) |
59.6267 us (± 7.9754) |
1.01 |
trace_nestedIf100 |
1.73314 ms (± 60.5668) |
1.72247 ms (± 66.8355) |
1.01 |
completing_trace_nestedIf100 |
1.79125 ms (± 62.83) |
1.77427 ms (± 61.2359) |
1.01 |
trace_chainedIf10 |
137.845 us (± 12.2125) |
137.168 us (± 17.8157) |
1.00 |
completing_trace_chainedIf10 |
72.2144 us (± 13.1599) |
70.1303 us (± 10.357) |
1.03 |
trace_chainedIf100 |
5.1015 ms (± 77.4143) |
5.08999 ms (± 63.2729) |
1.00 |
completing_trace_chainedIf100 |
2.66348 ms (± 61.8628) |
2.6717 ms (± 62.2542) |
1.00 |
comp_mlir_add |
8.33335 ms (± 159.834) |
8.48512 ms (± 244.662) |
0.98 |
comp_mlir_ifThenElse |
8.92078 ms (± 156.007) |
9.09679 ms (± 190.418) |
0.98 |
comp_mlir_deeplyNestedIfElse |
7.81353 ms (± 135.917) |
7.89905 ms (± 173.383) |
0.99 |
comp_mlir_loop |
9.96608 ms (± 311.028) |
10.1496 ms (± 230.42) |
0.98 |
comp_mlir_ifInsideLoop |
32.0703 ms (± 482.108) |
32.388 ms (± 435.684) |
0.99 |
comp_mlir_loopDirectCall |
14.7127 ms (± 269.935) |
14.9402 ms (± 206.957) |
0.98 |
comp_mlir_pointerLoop |
30.9575 ms (± 534.228) |
31.3524 ms (± 408.061) |
0.99 |
comp_mlir_staticLoop |
7.80078 ms (± 157.875) |
7.81756 ms (± 161.29) |
1.00 |
comp_mlir_fibonacci |
13.3058 ms (± 188.162) |
13.8811 ms (± 631.852) |
0.96 |
comp_mlir_gcd |
12.2529 ms (± 170.578) |
12.203 ms (± 173.002) |
1.00 |
comp_mlir_nestedIf10 |
13.3595 ms (± 260.402) |
13.2028 ms (± 184.419) |
1.01 |
comp_mlir_nestedIf100 |
27.685 ms (± 323.049) |
27.9993 ms (± 545.846) |
0.99 |
comp_mlir_chainedIf10 |
12.3306 ms (± 253.886) |
12.6389 ms (± 303.089) |
0.98 |
comp_mlir_chainedIf100 |
23.0489 ms (± 318.557) |
23.744 ms (± 344.023) |
0.97 |
comp_cpp_add |
25.0292 ms (± 559.584) |
25.8209 ms (± 628.014) |
0.97 |
comp_cpp_ifThenElse |
25.9164 ms (± 533.82) |
26.4155 ms (± 565.394) |
0.98 |
comp_cpp_deeplyNestedIfElse |
27.1399 ms (± 618.445) |
26.8696 ms (± 486.563) |
1.01 |
comp_cpp_loop |
26.0096 ms (± 484.183) |
26.0429 ms (± 576.43) |
1.00 |
comp_cpp_ifInsideLoop |
27.1924 ms (± 667.049) |
26.3645 ms (± 519.597) |
1.03 |
comp_cpp_loopDirectCall |
26.5718 ms (± 617.311) |
25.7393 ms (± 438.859) |
1.03 |
comp_cpp_pointerLoop |
26.4797 ms (± 507.878) |
25.7937 ms (± 569.907) |
1.03 |
comp_cpp_staticLoop |
25.6188 ms (± 417.155) |
25.5524 ms (± 573.482) |
1.00 |
comp_cpp_fibonacci |
26.1646 ms (± 694.59) |
25.6873 ms (± 669.172) |
1.02 |
comp_cpp_gcd |
26.434 ms (± 660.491) |
25.3463 ms (± 478.98) |
1.04 |
comp_cpp_nestedIf10 |
28.8039 ms (± 477.691) |
28.3012 ms (± 454.98) |
1.02 |
comp_cpp_nestedIf100 |
62.1752 ms (± 605.21) |
62.0761 ms (± 1.27738) |
1.00 |
comp_cpp_chainedIf10 |
31.2185 ms (± 490.661) |
30.9295 ms (± 747.458) |
1.01 |
comp_cpp_chainedIf100 |
91.9772 ms (± 519.699) |
91.2488 ms (± 719.382) |
1.01 |
comp_bc_add |
14.4368 us (± 2.54866) |
14.3004 us (± 2.32964) |
1.01 |
comp_bc_ifThenElse |
17.1303 us (± 2.79817) |
18.0949 us (± 4.4491) |
0.95 |
comp_bc_deeplyNestedIfElse |
22.5405 us (± 3.17576) |
22.198 us (± 3.26148) |
1.02 |
comp_bc_loop |
18.1043 us (± 3.27463) |
18.4165 us (± 3.64301) |
0.98 |
comp_bc_ifInsideLoop |
20.8547 us (± 3.5444) |
20.6663 us (± 2.96796) |
1.01 |
comp_bc_loopDirectCall |
18.8152 us (± 2.87151) |
18.7535 us (± 4.45535) |
1.00 |
comp_bc_pointerLoop |
19.7746 us (± 2.94881) |
19.2815 us (± 2.58919) |
1.03 |
comp_bc_staticLoop |
16.712 us (± 3.11049) |
16.5766 us (± 3.64641) |
1.01 |
comp_bc_fibonacci |
18.6845 us (± 3.57593) |
18.0191 us (± 2.73672) |
1.04 |
comp_bc_gcd |
16.9706 us (± 2.76578) |
17.8876 us (± 3.47981) |
0.95 |
comp_bc_nestedIf10 |
35.6601 us (± 4.05481) |
35.8101 us (± 4.87022) |
1.00 |
comp_bc_nestedIf100 |
183.04 us (± 10.8907) |
183.906 us (± 10.5337) |
1.00 |
comp_bc_chainedIf10 |
51.2047 us (± 7.09329) |
51.6162 us (± 7.77583) |
0.99 |
comp_bc_chainedIf100 |
282.947 us (± 15.505) |
295.087 us (± 13.9286) |
0.96 |
comp_asmjit_add |
21.6908 us (± 4.18288) |
21.9139 us (± 3.98743) |
0.99 |
comp_asmjit_ifThenElse |
34.9692 us (± 5.60785) |
35.0363 us (± 5.16342) |
1.00 |
comp_asmjit_deeplyNestedIfElse |
60.4437 us (± 13.5721) |
59.9185 us (± 9.08255) |
1.01 |
comp_asmjit_loop |
36.2086 us (± 4.46217) |
42.6871 us (± 8.19297) |
0.85 |
comp_asmjit_ifInsideLoop |
63.2648 us (± 13.0771) |
59.8357 us (± 9.79118) |
1.06 |
comp_asmjit_loopDirectCall |
46.524 us (± 7.93802) |
47.437 us (± 8.15676) |
0.98 |
comp_asmjit_pointerLoop |
49.8837 us (± 9.41894) |
50.3213 us (± 9.40177) |
0.99 |
comp_asmjit_staticLoop |
29.0133 us (± 4.47306) |
28.9683 us (± 4.73376) |
1.00 |
comp_asmjit_fibonacci |
45.0761 us (± 8.60741) |
45.5753 us (± 8.89015) |
0.99 |
comp_asmjit_gcd |
36.4401 us (± 5.74717) |
36.4937 us (± 5.22977) |
1.00 |
comp_asmjit_nestedIf10 |
109.118 us (± 12.9306) |
113.761 us (± 17.05) |
0.96 |
comp_asmjit_nestedIf100 |
1.13697 ms (± 27.5682) |
1.17881 ms (± 31.5389) |
0.96 |
comp_asmjit_chainedIf10 |
163.653 us (± 14.1508) |
167.036 us (± 13.8454) |
0.98 |
comp_asmjit_chainedIf100 |
2.26125 ms (± 48.699) |
2.28964 ms (± 33.8526) |
0.99 |
ir_add |
915.503 ns (± 133.434) |
864.571 ns (± 91.557) |
1.06 |
ir_ifThenElse |
2.61747 us (± 376.662) |
2.57735 us (± 319.73) |
1.02 |
ir_deeplyNestedIfElse |
7.08122 us (± 1.12608) |
6.89755 us (± 897.996) |
1.03 |
ir_loop |
3.11804 us (± 380.98) |
3.10713 us (± 431.249) |
1.00 |
ir_ifInsideLoop |
5.87999 us (± 652.3) |
5.92666 us (± 746.789) |
0.99 |
ir_loopDirectCall |
3.44648 us (± 496.331) |
3.30004 us (± 301.857) |
1.04 |
ir_pointerLoop |
4.01372 us (± 610.745) |
3.85236 us (± 474.74) |
1.04 |
ir_staticLoop |
2.44905 us (± 454.408) |
2.30272 us (± 223.316) |
1.06 |
ir_fibonacci |
3.34773 us (± 518.912) |
3.27638 us (± 425.306) |
1.02 |
ir_gcd |
2.77866 us (± 381.841) |
2.65842 us (± 228.618) |
1.05 |
ir_nestedIf10 |
16.7884 us (± 1.8008) |
16.4102 us (± 2.05049) |
1.02 |
ir_nestedIf100 |
191.626 us (± 8.21905) |
193.63 us (± 6.82735) |
0.99 |
ir_chainedIf10 |
30.0649 us (± 2.99335) |
29.6755 us (± 2.89174) |
1.01 |
ir_chainedIf100 |
369.322 us (± 11.2126) |
365.381 us (± 11.6014) |
1.01 |
ssa_add |
205.072 ns (± 33.8899) |
196.22 ns (± 22.809) |
1.05 |
ssa_ifThenElse |
512.556 ns (± 77.9121) |
475.7 ns (± 40.9818) |
1.08 |
ssa_deeplyNestedIfElse |
1.23927 us (± 123.27) |
1.22973 us (± 127.069) |
1.01 |
ssa_loop |
535.921 ns (± 75.3232) |
505.673 ns (± 57.5818) |
1.06 |
ssa_ifInsideLoop |
965.397 ns (± 99.8991) |
978.531 ns (± 117.327) |
0.99 |
ssa_loopDirectCall |
529.281 ns (± 63.8081) |
500.025 ns (± 44.5004) |
1.06 |
ssa_pointerLoop |
633.994 ns (± 69.0839) |
623.746 ns (± 62.4338) |
1.02 |
ssa_staticLoop |
552.406 ns (± 77.1713) |
550.173 ns (± 89.5966) |
1.00 |
ssa_fibonacci |
531.615 ns (± 54.0431) |
555.672 ns (± 94.9023) |
0.96 |
ssa_gcd |
462.974 ns (± 34.9472) |
520.723 ns (± 168.769) |
0.89 |
tiered_compile_addOne |
41.3136 us (± 10.6995) |
41.6425 us (± 11.2354) |
0.99 |
single_compile_mlir_addOne |
6.52105 ms (± 216.617) |
6.48998 ms (± 329.093) |
1.00 |
single_compile_cpp_addOne |
26.1465 ms (± 557.56) |
26.1906 ms (± 780.711) |
1.00 |
single_compile_bc_addOne |
42.5415 us (± 12.4946) |
42.3836 us (± 11.0238) |
1.00 |
tiered_compile_sumLoop |
61.71 us (± 13.0839) |
61.1521 us (± 12.1078) |
1.01 |
single_compile_mlir_sumLoop |
8.57714 ms (± 236.982) |
8.67139 ms (± 439.258) |
0.99 |
single_compile_cpp_sumLoop |
26.7041 ms (± 861.181) |
26.6711 ms (± 606.446) |
1.00 |
single_compile_bc_sumLoop |
62.3097 us (± 13.6451) |
62.8708 us (± 14.5141) |
0.99 |
e2e_tiered_bc_to_mlir |
42.889 us (± 12.4874) |
43.2149 us (± 13.5364) |
0.99 |
e2e_single_mlir |
8.35564 ms (± 208.594) |
8.24677 ms (± 145.888) |
1.01 |
exec_mlir_add |
9.44482 ns (± 0.537266) |
9.73747 ns (± 0.898217) |
0.97 |
exec_mlir_fibonacci |
13.5948 us (± 1.26605) |
13.6152 us (± 1.45369) |
1.00 |
exec_mlir_sum |
575.442 us (± 22.2906) |
571.848 us (± 25.1486) |
1.01 |
exec_cpp_add |
4.66317 ns (± 0.530745) |
4.84943 ns (± 1.04744) |
0.96 |
exec_cpp_fibonacci |
96.4577 us (± 8.40976) |
96.8868 us (± 9.72657) |
1.00 |
exec_cpp_sum |
35.9518 ms (± 96.2803) |
36.1302 ms (± 440.313) |
1.00 |
exec_bc_add |
44.2286 ns (± 5.80522) |
42.1954 ns (± 3.82737) |
1.05 |
exec_bc_fibonacci |
900.068 us (± 7.13396) |
906.298 us (± 30.0336) |
0.99 |
exec_bc_sum |
190.755 ms (± 388.855) |
190.715 ms (± 498.184) |
1.00 |
exec_asmjit_add |
3.20099 ns (± 0.290477) |
3.27822 ns (± 0.499463) |
0.98 |
exec_asmjit_fibonacci |
21.4862 us (± 2.2355) |
21.9394 us (± 3.5088) |
0.98 |
exec_asmjit_sum |
4.59414 ms (± 15.061) |
4.61489 ms (± 20.3506) |
1.00 |
exec_bc_addOne |
39.0611 ns (± 8.686) |
35.4491 ns (± 6.68627) |
1.10 |
exec_mlir_addOne |
283.751 ns (± 5.97818) |
291.212 ns (± 10.1709) |
0.97 |
exec_cpp_addOne |
4.05063 ns (± 0.719371) |
3.94796 ns (± 0.514023) |
1.03 |
exec_interpreted_addOne |
37.3876 ns (± 1.81838) |
38.1409 ns (± 2.13248) |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
New plugin that profiles code segments inside compiled Nautilus functions.
ProfileRegion::start()/stop() calls compile into the generated code via
invoke(), working across all backends (MLIR, C++, Bytecode, AsmJit).
Core API - ProfileRegion:
static ProfileRegion normalize("normalize");
val<int32_t> pipeline(val<int32_t*> data, val<int32_t> size) {
normalize.start();
// code + invoke(native_func, ...) ...
normalize.stop();
return size;
}
Features:
- Thread-safe: entry timestamps in thread-local storage, atomics for aggregates
- Hardware counters via perf_event_open (Linux): cycles, instructions, cache/branch misses
- Perf correlation: records [start,stop] intervals, correlates with `perf script`
samples to produce flame charts with full native call stacks scoped to regions
- Programmatic perf: Profiler::start()/stop() auto-forks perf record (Linux)
or sample (macOS), parses output, and correlates on stop
- Export: Chrome Trace JSON (Perfetto), folded stacks (flamegraph.pl)
- Graceful degradation when perf/sample unavailable (containers, VMs)
Tests: 10 test cases, 475 assertions across all backends
https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9
1577316 to
b889260
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New plugin that provides cross-backend function profiling for Nautilus:
https://claude.ai/code/session_01SLkb9NXtrjb4QAkiUTTMf9