Skip to content

Add ValueProfile and specialize() helper for argument value specialization#231

Open
PhilippGrulich wants to merge 1 commit intomainfrom
claude/function-specialization-plugin-Hb3Ds
Open

Add ValueProfile and specialize() helper for argument value specialization#231
PhilippGrulich wants to merge 1 commit intomainfrom
claude/function-specialization-plugin-Hb3Ds

Conversation

@PhilippGrulich
Copy link
Copy Markdown
Member

Introduces a header-only API in the specialization plugin that lets a
traced kernel mark arguments for value specialization. specialize(arg, p)
emits a profile-update proxy call when the profile is empty, and a
traced 'if (arg == c) { nautilus_assume(arg == c); arg = c; }'
dispatcher once the profile has stabilized so that downstream uses of
the argument can be const-folded by ConstantPropagationPhase + LLVM.

@PhilippGrulich PhilippGrulich force-pushed the claude/function-specialization-plugin-Hb3Ds branch from cd019a1 to 29801bc Compare April 7, 2026 20:59
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing Benchmark

Details
Benchmark suite Current: 20bc8ad Previous: 4ddc690 Ratio
ssa_add 190.88 ns (± 17.6943) 196.852 ns (± 18.2072) 0.97
ssa_ifThenElse 476.925 ns (± 44.4928) 486.861 ns (± 70.2234) 0.98
ssa_deeplyNestedIfElse 1.16644 us (± 87.9243) 1.1809 us (± 106.412) 0.99
ssa_loop 494.202 ns (± 41.4552) 514.039 ns (± 54.5865) 0.96
ssa_ifInsideLoop 902.565 ns (± 58.5923) 943.877 ns (± 70.5854) 0.96
ssa_loopDirectCall 500.529 ns (± 44.2476) 506.317 ns (± 59.915) 0.99
ssa_pointerLoop 604.171 ns (± 44.4348) 614.184 ns (± 51.7513) 0.98
ssa_staticLoop 501.13 ns (± 24.2694) 541.031 ns (± 83.2034) 0.93
ssa_fibonacci 511.478 ns (± 37.3635) 527.105 ns (± 56.114) 0.97
ssa_gcd 486.642 ns (± 60.6796) 487.012 ns (± 85.6455) 1.00
comp_mlir_add 8.19096 ms (± 369.505) 8.38057 ms (± 192.605) 0.98
comp_mlir_ifThenElse 8.68063 ms (± 101.675) 9.14395 ms (± 187.901) 0.95
comp_mlir_deeplyNestedIfElse 7.58213 ms (± 86.16) 7.7788 ms (± 165.806) 0.97
comp_mlir_loop 9.63447 ms (± 169.805) 9.9357 ms (± 189.58) 0.97
comp_mlir_ifInsideLoop 30.7378 ms (± 242.396) 31.4149 ms (± 339.44) 0.98
comp_mlir_loopDirectCall 14.15 ms (± 137.355) 14.7687 ms (± 880.984) 0.96
comp_mlir_pointerLoop 29.8684 ms (± 236.03) 30.4114 ms (± 282.375) 0.98
comp_mlir_staticLoop 7.52574 ms (± 91.2368) 7.69672 ms (± 127.901) 0.98
comp_mlir_fibonacci 12.8877 ms (± 287.091) 13.2546 ms (± 177.381) 0.97
comp_mlir_gcd 11.8734 ms (± 159.612) 12.4331 ms (± 330.253) 0.95
comp_mlir_nestedIf10 12.9695 ms (± 152.78) 13.5025 ms (± 306.971) 0.96
comp_mlir_nestedIf100 27.6422 ms (± 325.914) 27.4857 ms (± 419.81) 1.01
comp_mlir_chainedIf10 11.9553 ms (± 210.644) 12.7826 ms (± 270.187) 0.94
comp_mlir_chainedIf100 22.6107 ms (± 177.664) 24.0902 ms (± 540.118) 0.94
comp_cpp_add 24.5071 ms (± 355.538) 25.6204 ms (± 845.684) 0.96
comp_cpp_ifThenElse 25.0888 ms (± 297.095) 25.9943 ms (± 587.031) 0.97
comp_cpp_deeplyNestedIfElse 26.1689 ms (± 382.855) 27.8513 ms (± 1.22198) 0.94
comp_cpp_loop 25.2702 ms (± 308.487) 26.5132 ms (± 801.141) 0.95
comp_cpp_ifInsideLoop 26.0768 ms (± 263.189) 27.3091 ms (± 703.368) 0.95
comp_cpp_loopDirectCall 25.469 ms (± 319.87) 27.1496 ms (± 557.475) 0.94
comp_cpp_pointerLoop 25.6961 ms (± 268.471) 27.0546 ms (± 527.374) 0.95
comp_cpp_staticLoop 24.8914 ms (± 254.397) 25.7821 ms (± 833.768) 0.97
comp_cpp_fibonacci 25.325 ms (± 248.615) 26.1467 ms (± 746.022) 0.97
comp_cpp_gcd 25.1555 ms (± 277.306) 26.283 ms (± 1.10201) 0.96
comp_cpp_nestedIf10 27.9958 ms (± 277.645) 28.2778 ms (± 475.46) 0.99
comp_cpp_nestedIf100 61.5579 ms (± 370.596) 62.5057 ms (± 680.203) 0.98
comp_cpp_chainedIf10 30.5678 ms (± 285.832) 30.8186 ms (± 570.568) 0.99
comp_cpp_chainedIf100 91.1343 ms (± 844.182) 94.8954 ms (± 3.63672) 0.96
comp_bc_add 14.2364 us (± 1.75678) 14.5932 us (± 2.09862) 0.98
comp_bc_ifThenElse 17.8769 us (± 2.40526) 17.5758 us (± 2.73918) 1.02
comp_bc_deeplyNestedIfElse 22.1569 us (± 2.56091) 24.34 us (± 5.64301) 0.91
comp_bc_loop 17.8982 us (± 2.32555) 18.7744 us (± 4.12069) 0.95
comp_bc_ifInsideLoop 20.4437 us (± 2.44517) 20.7524 us (± 2.69374) 0.99
comp_bc_loopDirectCall 18.9932 us (± 2.85576) 19.1342 us (± 2.85075) 0.99
comp_bc_pointerLoop 19.7144 us (± 2.77798) 20.412 us (± 3.05277) 0.97
comp_bc_staticLoop 16.8452 us (± 2.86847) 17.3441 us (± 3.25445) 0.97
comp_bc_fibonacci 18.3064 us (± 2.43635) 19.1498 us (± 2.86824) 0.96
comp_bc_gcd 18.0169 us (± 2.93763) 18.3923 us (± 3.5189) 0.98
comp_bc_nestedIf10 35.0084 us (± 3.47404) 35.96 us (± 4.1926) 0.97
comp_bc_nestedIf100 176.776 us (± 7.80642) 201.72 us (± 33.1657) 0.88
comp_bc_chainedIf10 50.2744 us (± 8.30082) 55.8101 us (± 13.0197) 0.90
comp_bc_chainedIf100 277.791 us (± 10.7667) 302.553 us (± 23.62) 0.92
comp_asmjit_add 21.2692 us (± 4.01024) 22.3937 us (± 4.48756) 0.95
comp_asmjit_ifThenElse 34.2081 us (± 4.59567) 34.8408 us (± 4.85456) 0.98
comp_asmjit_deeplyNestedIfElse 58.8082 us (± 8.78841) 60.2774 us (± 9.17425) 0.98
comp_asmjit_loop 35.8823 us (± 3.84668) 42.025 us (± 7.05867) 0.85
comp_asmjit_ifInsideLoop 57.5709 us (± 6.86521) 60.0379 us (± 10.8082) 0.96
comp_asmjit_loopDirectCall 46.399 us (± 6.97435) 47.8621 us (± 9.77711) 0.97
comp_asmjit_pointerLoop 48.3917 us (± 6.45883) 50.4541 us (± 11.156) 0.96
comp_asmjit_staticLoop 28.1258 us (± 4.00051) 28.8697 us (± 4.41559) 0.97
comp_asmjit_fibonacci 44.1299 us (± 6.61175) 45.3356 us (± 9.25608) 0.97
comp_asmjit_gcd 36.4854 us (± 6.43279) 35.7476 us (± 5.30607) 1.02
comp_asmjit_nestedIf10 109.963 us (± 11.6552) 111.032 us (± 14.5965) 0.99
comp_asmjit_nestedIf100 1.1385 ms (± 12.9426) 1.14527 ms (± 20.1439) 0.99
comp_asmjit_chainedIf10 162.08 us (± 11.4284) 166.561 us (± 15.152) 0.97
comp_asmjit_chainedIf100 2.28201 ms (± 25.9004) 2.27493 ms (± 36.9212) 1.00
ir_add 854.808 ns (± 71.7311) 899.987 ns (± 116.357) 0.95
ir_ifThenElse 2.40972 us (± 165.045) 2.50916 us (± 309.896) 0.96
ir_deeplyNestedIfElse 6.86063 us (± 931.152) 6.62996 us (± 689.783) 1.03
ir_loop 2.97028 us (± 455.568) 3.15779 us (± 491.542) 0.94
ir_ifInsideLoop 5.55614 us (± 390.397) 5.78225 us (± 651.21) 0.96
ir_loopDirectCall 3.13579 us (± 224.141) 3.43855 us (± 555.66) 0.91
ir_pointerLoop 3.6634 us (± 305.335) 3.81923 us (± 449.197) 0.96
ir_staticLoop 2.20591 us (± 164.315) 2.28226 us (± 197.398) 0.97
ir_fibonacci 3.09834 us (± 252.729) 3.15796 us (± 298.215) 0.98
ir_gcd 2.66002 us (± 246.515) 2.71337 us (± 324.906) 0.98
ir_nestedIf10 15.4621 us (± 1.21789) 16.2159 us (± 1.70068) 0.95
ir_nestedIf100 187.069 us (± 6.41364) 195.124 us (± 18.2958) 0.96
ir_chainedIf10 28.3291 us (± 1.73383) 29.4978 us (± 3.21451) 0.96
ir_chainedIf100 358.614 us (± 9.74037) 372.238 us (± 13.6387) 0.96
tiered_compile_addOne 41.4017 us (± 7.66667) 36.1323 us (± 10.3872) 1.15
single_compile_mlir_addOne 6.07201 ms (± 89.374) 6.69496 ms (± 416.394) 0.91
single_compile_cpp_addOne 24.3658 ms (± 268.6) 26.6951 ms (± 481.83) 0.91
single_compile_bc_addOne 41.6916 us (± 5.95621) 42.6381 us (± 11.2121) 0.98
tiered_compile_sumLoop 59.9794 us (± 8.56729) 64.5505 us (± 12.6459) 0.93
single_compile_mlir_sumLoop 8.09942 ms (± 115.888) 9.04856 ms (± 286.628) 0.90
single_compile_cpp_sumLoop 24.8437 ms (± 263.17) 27.4246 ms (± 571.506) 0.91
single_compile_bc_sumLoop 60.3862 us (± 6.61555) 61.7793 us (± 14.2547) 0.98
trace_add 2.55804 us (± 199.179) 2.56751 us (± 312.946) 1.00
completing_trace_add 2.67806 us (± 590.93) 2.62229 us (± 366.074) 1.02
trace_ifThenElse 11.673 us (± 1.37586) 12.1875 us (± 2.1946) 0.96
completing_trace_ifThenElse 5.35939 us (± 507.912) 5.66397 us (± 836.071) 0.95
trace_deeplyNestedIfElse 35.4159 us (± 4.26678) 35.7092 us (± 7.02971) 0.99
completing_trace_deeplyNestedIfElse 15.5455 us (± 1.61045) 15.7523 us (± 3.59692) 0.99
trace_loop 11.5549 us (± 1.57806) 11.3827 us (± 1.74982) 1.02
completing_trace_loop 5.30657 us (± 472.132) 5.34366 us (± 671.493) 0.99
trace_ifInsideLoop 22.2748 us (± 2.05627) 23.2593 us (± 3.89688) 0.96
completing_trace_ifInsideLoop 10.1716 us (± 1.08231) 10.3211 us (± 1.4055) 0.99
trace_loopDirectCall 11.3351 us (± 1.0882) 11.5762 us (± 1.87128) 0.98
completing_trace_loopDirectCall 5.37777 us (± 658.769) 5.55179 us (± 776.341) 0.97
trace_pointerLoop 17.439 us (± 2.75791) 17.7094 us (± 3.40857) 0.98
completing_trace_pointerLoop 11.3092 us (± 1.06916) 11.8254 us (± 1.97681) 0.96
trace_staticLoop 9.16521 us (± 769.186) 9.39437 us (± 1.13559) 0.98
completing_trace_staticLoop 8.67855 us (± 784.69) 9.27847 us (± 1.29479) 0.94
trace_fibonacci 12.7072 us (± 1.14803) 13.8422 us (± 3.27903) 0.92
completing_trace_fibonacci 6.68079 us (± 888.898) 7.03076 us (± 996.33) 0.95
trace_gcd 10.6098 us (± 1.22677) 10.8536 us (± 2.31146) 0.98
completing_trace_gcd 4.47704 us (± 327.454) 4.59746 us (± 532.736) 0.97
trace_nestedIf10 55.4052 us (± 5.78169) 56.6039 us (± 9.13368) 0.98
completing_trace_nestedIf10 54.5952 us (± 5.8429) 57.1039 us (± 8.66097) 0.96
trace_nestedIf100 1.7427 ms (± 48.1858) 1.75731 ms (± 48.612) 0.99
completing_trace_nestedIf100 1.79803 ms (± 38.0432) 1.80528 ms (± 61.1031) 1.00
trace_chainedIf10 136.125 us (± 6.8899) 142.906 us (± 20.5703) 0.95
completing_trace_chainedIf10 70.1246 us (± 10.3851) 71.253 us (± 10.5092) 0.98
trace_chainedIf100 5.10896 ms (± 42.3906) 5.11334 ms (± 59.9812) 1.00
completing_trace_chainedIf100 2.74956 ms (± 29.7741) 2.77679 ms (± 53.4349) 0.99
e2e_tiered_bc_to_mlir 43.1178 us (± 9.90148) 42.742 us (± 12.6185) 1.01
e2e_single_mlir 7.92751 ms (± 76.5823) 8.36387 ms (± 191.751) 0.95
exec_mlir_add 9.86577 ns (± 0.881544) 10.2965 ns (± 1.03277) 0.96
exec_mlir_fibonacci 13.1156 us (± 888.427) 15.9968 us (± 2.35392) 0.82
exec_mlir_sum 530.327 us (± 19.1346) 594.472 us (± 29.2638) 0.89
exec_cpp_add 4.70183 ns (± 0.794904) 4.73691 ns (± 0.799371) 0.99
exec_cpp_fibonacci 96.4755 us (± 9.99885) 99.3508 us (± 15.1486) 0.97
exec_cpp_sum 35.9154 ms (± 112.524) 35.9916 ms (± 159.868) 1.00
exec_bc_add 42.9882 ns (± 4.07597) 44.2275 ns (± 6.98237) 0.97
exec_bc_fibonacci 899.318 us (± 7.32116) 902.474 us (± 10.0672) 1.00
exec_bc_sum 190.626 ms (± 291.854) 190.799 ms (± 397.461) 1.00
exec_asmjit_add 3.21068 ns (± 0.278466) 3.24248 ns (± 0.494265) 0.99
exec_asmjit_fibonacci 22.6468 us (± 5.57331) 21.3141 us (± 1.36548) 1.06
exec_asmjit_sum 4.61201 ms (± 28.2952) 4.62116 ms (± 62.4678) 1.00
exec_bc_addOne 36.1161 ns (± 4.8241) 37.2873 ns (± 7.38083) 0.97
exec_mlir_addOne 274.094 ns (± 3.33802) 299.046 ns (± 11.2662) 0.92
exec_cpp_addOne 3.9852 ns (± 0.689287) 4.03145 ns (± 0.734027) 0.99
exec_interpreted_addOne 38.0425 ns (± 1.89879) 38.1182 ns (± 2.30459) 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@PhilippGrulich PhilippGrulich force-pushed the claude/function-specialization-plugin-Hb3Ds branch 3 times, most recently from 4c65973 to 30c1e3b Compare April 8, 2026 04:00
Adds the nautilus specialization plugin providing:
- ValueProfile<T> for runtime argument value profiling
- SpecializedNautilusFunction wrapper that behaves like NautilusFunction
  but emits a nested dispatcher function which routes calls to a
  specialized or generic compiled variant based on a stable profile
- assume() intrinsic plumbing (existing)

Includes backend-parameterized behavioural tests across all enabled
code-gen backends (mlir/cpp/bc/asmjit) plus the interpreter, MLIR
IR-shape inspection tests, and CI integration for both the regular
and LLVM IR test executables.

Also fixes an MLIRLoweringProvider bug where blockMapping was not
cleared between successive generateFunction calls, which caused
'reference to block defined in another region' errors when multiple
functions with identically-named blocks coexisted in one module.
@PhilippGrulich PhilippGrulich force-pushed the claude/function-specialization-plugin-Hb3Ds branch from 30c1e3b to 20bc8ad Compare April 8, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants