Add Python bindings for Nautilus val types via pybind11#213
Open
PhilippGrulich wants to merge 15 commits intomainfrom
Open
Add Python bindings for Nautilus val types via pybind11#213PhilippGrulich wants to merge 15 commits intomainfrom
PhilippGrulich wants to merge 15 commits intomainfrom
Conversation
Introduces a Python API for Nautilus val objects with pybind11 bindings: - Val types: ValInt32, ValInt64, ValFloat32, ValFloat64, ValBool with full operator overloading (arithmetic, bitwise, comparison, logical) - Engine API: NautilusEngine wrapper with @engine.compile decorator that infers function signatures from Python type annotations - Native control flow: Python if/else works via __bool__() → traceBool() - select() for branchless ternary operations - Type casting between val types (cast_to_float64, cast_to_int32, etc.) - 63 passing tests covering operators, booleans, casting, compilation, and simple if/else control flow Build with: cmake -DENABLE_PYTHON_BINDINGS=ON .. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Three root causes prevented Python-traced functions from handling nested control flow: 1. Stack-based tags cannot distinguish Python call sites: The tracer uses __builtin_return_address to build unique tags per code position, but all Python calls through the same pybind11 dispatcher (e.g. two different __gt__ comparisons) produce identical stack backtraces. This caused false snapshot collisions in globalTagMap during the first trace iteration, triggering premature control-flow merge detection and corrupting the trace. Fix: Add ExternalPositionHint — a thread-local callback that the Python bindings set to return PyFrame_GetLasti() (the current Python bytecode offset). This value is XOR'd into the snapshot hash in recordSnapshot(), making snapshots from different Python source positions distinct while remaining stable across trace iterations. 2. Compound assignment created copies breaking loop detection: __iadd__ and similar operators returned val objects by value, creating new traced objects each iteration. This changed the AliveVariableHash across loop iterations, preventing snapshot matching for loop back-edge detection. Fix: Return by reference with py::return_value_policy::reference_internal. 3. TraceTerminationException couldn't propagate through Python: pybind11 converted the C++ exception to a generic Python RuntimeError, which couldn't be re-caught by the trace loop. Fix: Register TraceTerminationException as a Python exception type and re-throw it in each registration wrapper's catch handler. All 71 Python tests and 129 C++ tests pass. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Relocate all Python-related code (pybind11 bindings, Python package, and tests) to the project-root plugins/python directory. Update CMake add_subdirectory path, include directories, and test sys.path entries to match the new layout. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add NautilusModule bindings for batch-compiling multiple named functions into a single compilation unit (Module/CompiledModule Python wrappers) - Add NautilusFunction bindings for first-class callable functions that are intercepted during tracing (nautilus_function decorator) - Support 6 module signatures and 4 NautilusFunction signatures - Add tests for multiple independent functions on same engine (3 tests) - Add tests for NautilusModule batch compilation (6 tests) - Add tests for NautilusFunction tracing interception (4 tests) - Update __init__.py with Module, CompiledModule, and nautilus_function API https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Contributor
There was a problem hiding this comment.
Tracing Benchmark
Details
| Benchmark suite | Current: f2fd897 | Previous: a3f3e9c | Ratio |
|---|---|---|---|
trace_add |
3.59653 us (± 576.485) |
2.57955 us (± 268.144) |
1.39 |
completing_trace_add |
3.16848 us (± 666.515) |
2.54844 us (± 288.334) |
1.24 |
trace_ifThenElse |
15.2576 us (± 3.03806) |
12.1076 us (± 2.12239) |
1.26 |
completing_trace_ifThenElse |
7.83688 us (± 1.35849) |
5.30708 us (± 652.345) |
1.48 |
trace_deeplyNestedIfElse |
44.8904 us (± 9.07869) |
36.6173 us (± 7.37154) |
1.23 |
completing_trace_deeplyNestedIfElse |
20.8928 us (± 3.91357) |
15.8437 us (± 3.45161) |
1.32 |
trace_loop |
15.7337 us (± 2.45831) |
11.8298 us (± 1.86512) |
1.33 |
completing_trace_loop |
7.44274 us (± 1.22451) |
5.39947 us (± 682.77) |
1.38 |
trace_ifInsideLoop |
30.9589 us (± 5.49375) |
23.4552 us (± 3.32729) |
1.32 |
completing_trace_ifInsideLoop |
14.2496 us (± 2.62911) |
10.2411 us (± 1.82447) |
1.39 |
trace_loopDirectCall |
15.7997 us (± 2.7565) |
11.9955 us (± 1.81973) |
1.32 |
completing_trace_loopDirectCall |
8.11704 us (± 1.22071) |
5.34824 us (± 671.043) |
1.52 |
trace_pointerLoop |
23.5504 us (± 4.28354) |
18.3703 us (± 3.33774) |
1.28 |
completing_trace_pointerLoop |
15.7292 us (± 2.47179) |
11.493 us (± 1.73451) |
1.37 |
trace_staticLoop |
16.9551 us (± 2.32025) |
9.20622 us (± 1.0699) |
1.84 |
completing_trace_staticLoop |
15.9759 us (± 2.27471) |
8.9735 us (± 1.10899) |
1.78 |
trace_fibonacci |
17.6814 us (± 2.78306) |
14.1719 us (± 2.66331) |
1.25 |
completing_trace_fibonacci |
9.59973 us (± 1.42366) |
6.59787 us (± 812.253) |
1.45 |
trace_gcd |
14.9798 us (± 2.18011) |
11.0463 us (± 1.73258) |
1.36 |
completing_trace_gcd |
6.92007 us (± 1.15669) |
4.65348 us (± 511.873) |
1.49 |
trace_nestedIf10 |
79.8917 us (± 18.5807) |
55.4514 us (± 8.08936) |
1.44 |
completing_trace_nestedIf10 |
82.1429 us (± 20.8319) |
55.6325 us (± 7.83651) |
1.48 |
trace_nestedIf100 |
1.81277 ms (± 186.288) |
1.81343 ms (± 205.134) |
1.00 |
completing_trace_nestedIf100 |
1.81595 ms (± 162.212) |
1.82224 ms (± 39.2657) |
1.00 |
trace_chainedIf10 |
160.256 us (± 30.8356) |
141.567 us (± 11.4945) |
1.13 |
completing_trace_chainedIf10 |
89.3013 us (± 22.2154) |
71.1739 us (± 10.1789) |
1.25 |
trace_chainedIf100 |
5.2123 ms (± 175.106) |
5.13347 ms (± 45.2127) |
1.02 |
completing_trace_chainedIf100 |
2.69159 ms (± 165.37) |
2.80508 ms (± 38.7944) |
0.96 |
exec_mlir_add |
3.9377 ns (± 0.468665) |
3.47518 ns (± 0.405246) |
1.13 |
exec_mlir_fibonacci |
5.77976 us (± 686.495) |
5.04737 us (± 1.08062) |
1.15 |
exec_mlir_sum |
508.806 us (± 13.6253) |
578.145 us (± 64.0147) |
0.88 |
exec_cpp_add |
6.2782 ns (± 1.13576) |
5.33205 ns (± 0.543434) |
1.18 |
exec_cpp_fibonacci |
95.6696 us (± 8.16666) |
95.6188 us (± 9.36912) |
1.00 |
exec_cpp_sum |
35.9899 ms (± 256.502) |
35.9484 ms (± 76.5862) |
1.00 |
exec_bc_add |
67.1796 ns (± 9.42535) |
44.2215 ns (± 6.47502) |
1.52 |
exec_bc_fibonacci |
844.33 us (± 95.3422) |
822.327 us (± 12.6536) |
1.03 |
exec_bc_sum |
175.959 ms (± 4.38406) |
176.028 ms (± 659.013) |
1.00 |
exec_asmjit_add |
4.5988 ns (± 0.958504) |
3.60814 ns (± 0.578975) |
1.27 |
exec_asmjit_fibonacci |
24.8318 us (± 4.65585) |
21.6288 us (± 3.24613) |
1.15 |
exec_asmjit_sum |
4.77003 ms (± 366.309) |
4.45697 ms (± 45.4201) |
1.07 |
ssa_add |
318.404 ns (± 52.9742) |
194.671 ns (± 16.1002) |
1.64 |
ssa_ifThenElse |
716.596 ns (± 134.003) |
481.598 ns (± 33.58) |
1.49 |
ssa_deeplyNestedIfElse |
1.44808 us (± 296.088) |
1.25854 us (± 168.561) |
1.15 |
ssa_loop |
720.961 ns (± 142.979) |
523.862 ns (± 48.3283) |
1.38 |
ssa_ifInsideLoop |
1243.6499999999999 ns (± 268836) |
967.22 ns (± 92.4438) |
1.29 |
ssa_loopDirectCall |
622.47 ns (± 139.459) |
521.524 ns (± 35.9294) |
1.19 |
ssa_pointerLoop |
769.924 ns (± 158.652) |
645.002 ns (± 54.8434) |
1.19 |
ssa_staticLoop |
776.283 ns (± 102.914) |
510.093 ns (± 48.6925) |
1.52 |
ssa_fibonacci |
583.81 ns (± 102.567) |
547.878 ns (± 46.9578) |
1.07 |
ssa_gcd |
702.155 ns (± 163.612) |
498.361 ns (± 58.7311) |
1.41 |
comp_mlir_add |
8.25743 ms (± 99.0941) |
8.81596 ms (± 534.547) |
0.94 |
comp_mlir_ifThenElse |
8.8621 ms (± 161.185) |
9.38068 ms (± 535.218) |
0.94 |
comp_mlir_deeplyNestedIfElse |
7.67862 ms (± 75.9376) |
8.13313 ms (± 485.529) |
0.94 |
comp_mlir_loop |
9.76778 ms (± 108.885) |
10.0294 ms (± 376.951) |
0.97 |
comp_mlir_ifInsideLoop |
31.1872 ms (± 397.64) |
32.1179 ms (± 321.227) |
0.97 |
comp_mlir_loopDirectCall |
14.3121 ms (± 163.719) |
15.1037 ms (± 329.851) |
0.95 |
comp_mlir_pointerLoop |
30.0807 ms (± 236.948) |
31.5576 ms (± 340.952) |
0.95 |
comp_mlir_staticLoop |
7.68311 ms (± 115.074) |
7.7594 ms (± 256.495) |
0.99 |
comp_mlir_fibonacci |
13.2116 ms (± 673.477) |
13.4276 ms (± 339.291) |
0.98 |
comp_mlir_gcd |
12.0756 ms (± 135.092) |
12.254 ms (± 311.131) |
0.99 |
comp_mlir_nestedIf10 |
13.0973 ms (± 121.297) |
13.5831 ms (± 902.405) |
0.96 |
comp_mlir_nestedIf100 |
27.5084 ms (± 466.113) |
28.6773 ms (± 317.1) |
0.96 |
comp_mlir_chainedIf10 |
12.1837 ms (± 165.951) |
12.6642 ms (± 379.543) |
0.96 |
comp_mlir_chainedIf100 |
23.1921 ms (± 323.82) |
23.8916 ms (± 324.958) |
0.97 |
comp_cpp_add |
25.42 ms (± 820.564) |
25.8794 ms (± 368.535) |
0.98 |
comp_cpp_ifThenElse |
25.8234 ms (± 831.268) |
27.5994 ms (± 1.11786) |
0.94 |
comp_cpp_deeplyNestedIfElse |
26.6592 ms (± 832.736) |
28.1188 ms (± 667.155) |
0.95 |
comp_cpp_loop |
25.7914 ms (± 708.893) |
26.7929 ms (± 361.011) |
0.96 |
comp_cpp_ifInsideLoop |
26.884 ms (± 747.379) |
27.855 ms (± 437.402) |
0.97 |
comp_cpp_loopDirectCall |
26.4368 ms (± 623.294) |
27.3539 ms (± 383.478) |
0.97 |
comp_cpp_pointerLoop |
26.5562 ms (± 578.606) |
27.1963 ms (± 330.599) |
0.98 |
comp_cpp_staticLoop |
25.6837 ms (± 603.659) |
26.1362 ms (± 340.064) |
0.98 |
comp_cpp_fibonacci |
26.049 ms (± 529.325) |
26.6565 ms (± 383.856) |
0.98 |
comp_cpp_gcd |
25.5785 ms (± 756.878) |
26.3853 ms (± 522.799) |
0.97 |
comp_cpp_nestedIf10 |
28.4524 ms (± 505.847) |
29.3989 ms (± 349.145) |
0.97 |
comp_cpp_nestedIf100 |
61.6504 ms (± 503.445) |
62.5396 ms (± 487.132) |
0.99 |
comp_cpp_chainedIf10 |
30.6603 ms (± 499.985) |
32.1014 ms (± 671.871) |
0.96 |
comp_cpp_chainedIf100 |
91.7875 ms (± 1.76442) |
92.7786 ms (± 1.07619) |
0.99 |
comp_bc_add |
14.8319 us (± 2.10792) |
14.781 us (± 2.37726) |
1.00 |
comp_bc_ifThenElse |
17.9507 us (± 2.66861) |
18.3066 us (± 3.31947) |
0.98 |
comp_bc_deeplyNestedIfElse |
22.8133 us (± 3.46982) |
21.973 us (± 3.41507) |
1.04 |
comp_bc_loop |
18.5566 us (± 3.50124) |
18.2333 us (± 3.69041) |
1.02 |
comp_bc_ifInsideLoop |
20.8252 us (± 2.73671) |
20.8739 us (± 4.33151) |
1.00 |
comp_bc_loopDirectCall |
19.1656 us (± 3.29112) |
19.2007 us (± 3.78413) |
1.00 |
comp_bc_pointerLoop |
20.1164 us (± 3.16482) |
19.8644 us (± 3.77911) |
1.01 |
comp_bc_staticLoop |
17.1578 us (± 2.85176) |
16.7879 us (± 3.38173) |
1.02 |
comp_bc_fibonacci |
18.6145 us (± 2.77741) |
18.3663 us (± 3.27136) |
1.01 |
comp_bc_gcd |
25.3225 us (± 6.83165) |
18.0152 us (± 2.76342) |
1.41 |
comp_bc_nestedIf10 |
35.6392 us (± 4.26413) |
35.8522 us (± 4.28134) |
0.99 |
comp_bc_nestedIf100 |
177.235 us (± 7.4267) |
181.097 us (± 11.8962) |
0.98 |
comp_bc_chainedIf10 |
51.0544 us (± 6.53563) |
49.3702 us (± 7.50218) |
1.03 |
comp_bc_chainedIf100 |
286.479 us (± 12.7824) |
290.818 us (± 17.3544) |
0.99 |
comp_asmjit_add |
21.0391 us (± 2.63754) |
21.188 us (± 4.87693) |
0.99 |
comp_asmjit_ifThenElse |
34.7083 us (± 5.39809) |
33.5501 us (± 5.38552) |
1.03 |
comp_asmjit_deeplyNestedIfElse |
65.5515 us (± 11.3756) |
57.7212 us (± 9.17995) |
1.14 |
comp_asmjit_loop |
38.8615 us (± 5.93744) |
35.6318 us (± 5.27654) |
1.09 |
comp_asmjit_ifInsideLoop |
60.8 us (± 10.0116) |
59.4218 us (± 11.9057) |
1.02 |
comp_asmjit_loopDirectCall |
48.8647 us (± 10.0422) |
46.0328 us (± 9.68253) |
1.06 |
comp_asmjit_pointerLoop |
53.2866 us (± 9.62254) |
48.8339 us (± 10.9407) |
1.09 |
comp_asmjit_staticLoop |
30.821 us (± 5.15104) |
28.3494 us (± 4.25178) |
1.09 |
comp_asmjit_fibonacci |
45.5484 us (± 6.69808) |
43.6687 us (± 8.65062) |
1.04 |
comp_asmjit_gcd |
38.5195 us (± 6.24653) |
35.1428 us (± 5.70744) |
1.10 |
comp_asmjit_nestedIf10 |
123.537 us (± 23.2415) |
109.174 us (± 12.5732) |
1.13 |
comp_asmjit_nestedIf100 |
1.17893 ms (± 42.9443) |
1.14775 ms (± 40.6487) |
1.03 |
comp_asmjit_chainedIf10 |
181.791 us (± 25.0632) |
165.37 us (± 18.1415) |
1.10 |
comp_asmjit_chainedIf100 |
2.39446 ms (± 81.7349) |
2.26744 ms (± 44.0847) |
1.06 |
ir_add |
1307.34 ns (± 237174) |
854.33 ns (± 81.6744) |
1.53 |
ir_ifThenElse |
3.53007 us (± 754.951) |
2.671 us (± 456.984) |
1.32 |
ir_deeplyNestedIfElse |
8.01014 us (± 1.93109) |
6.88012 us (± 823.252) |
1.16 |
ir_loop |
3.74549 us (± 851.188) |
3.04024 us (± 373.687) |
1.23 |
ir_ifInsideLoop |
6.9274 us (± 1.7336) |
5.81858 us (± 746.136) |
1.19 |
ir_loopDirectCall |
4.35151 us (± 907.141) |
3.2659 us (± 376.995) |
1.33 |
ir_pointerLoop |
4.51983 us (± 935.667) |
3.7749 us (± 569.99) |
1.20 |
ir_staticLoop |
2.88822 us (± 623.168) |
2.25725 us (± 158.869) |
1.28 |
ir_fibonacci |
4.26741 us (± 881.408) |
3.17196 us (± 234.985) |
1.35 |
ir_gcd |
3.6699 us (± 765.129) |
2.66647 us (± 302.248) |
1.38 |
ir_nestedIf10 |
17.204 us (± 4.31678) |
16.1584 us (± 1.86538) |
1.06 |
ir_nestedIf100 |
184.761 us (± 11.7132) |
193.558 us (± 8.17045) |
0.95 |
ir_chainedIf10 |
31.4909 us (± 6.07932) |
29.8691 us (± 1.67191) |
1.05 |
ir_chainedIf100 |
365.465 us (± 14.7672) |
370.333 us (± 13.0525) |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
Users can now write `def f(x: int) -> int` instead of `def f(x: ValInt32) -> ValInt32`. Python types are automatically mapped: int->ValInt32, float->ValFloat64, bool->ValBool. Type aliases (nautilus.int64, nautilus.float32, etc.) provide explicit bit-width control. Existing Val-type annotations still work. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
All three APIs (Engine, Module, NautilusFunction) now support the complete set of 10 signatures: i32, i32x2, i64, i64x2, f64, f64x2, f32, f32x2, bool, and i32->bool predicate. Previously Module was missing f32, f32x2, bool, i32->bool and NautilusFunction was missing i64x2, f64x2, f32, f32x2, bool, i32->bool. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Wrap arbitrary Python objects as val<void*> and forward all operations (arithmetic, comparison, bool conversion) to the Python runtime via invoke() trampolines that call the Python C API. A thread-local arena tracks intermediate PyObject* references for correct refcounting. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Implement .call("method", args...) and .getattr("name") on ValObject,
enabling compiled Python code to invoke arbitrary methods on generic
objects. Uses invoke() with GIL-acquiring trampolines that call
PyObject_GetAttr + PyObject_Call. Supports 0-3 arguments.
https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Bug fixes: - Fix use-after-free in all 4 object call sites: reinterpret_borrow did not incref, but arenaClear decreffed, leaving a dangling pointer. Now Py_XINCREF result before arenaClear + use reinterpret_steal. - Fix PyObject_RichCompareBool error handling: -1 (error) was silently mapped to false; now calls PyErr_Clear on error. - Fix PyTuple_Pack null check in call() trampolines: null would segfault on PyObject_Call. Simplification: - Extract 15 named trampoline functions (pyAdd, pySub, pyEq, etc.) to py_object_helpers.hpp, replacing ~65 inline lambdas. - Template makeCppFunc and PyNautilusFunc, replacing 12 copy-paste functions and 12 copy-paste structs with one template each. - Use REGISTER_NAUTILUS_FUNC macro for pybind11 class registrations. Added missing operator overloads for ValObject consistency: - __rtruediv__, __rfloordiv__, __rmod__ (reverse operators) - __ne__, __le__, __ge__ with py::object overloads - __isub__, __imul__ with py::object overloads - __floordiv__, __mod__ with py::object overloads Net: -327 lines across 3 files. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add pyproject.toml enabling `pip install -e .` for the Python plugin - Add tests/conftest.py that auto-discovers the CMake build directory - Remove hardcoded sys.path hacks from all 7 test files Usage after CMake build: cd plugins/python && pip install -e . pytest tests/ https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Generated by pip install -e . and should not be tracked. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add Options.set() method that auto-detects Python types (bool, int, float, str) and Options.set_double() for completeness - Engine now accepts **kwargs for compiler options with underscore-to-dot conversion: Engine(dump_console=True) → dump.console=True - Engine also accepts an Options object: Engine(options=opts) - Add 5 tests covering kwargs, Options object, type detection, invalid type rejection, and dump options Usage: engine = Engine(backend="mlir", mlir_optimizationLevel=2) engine = Engine(dump_console=True, dump_after_tracing=True) https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Extends the Python plugin to support functions with 3 parameters for all numeric types (int32, int64, float32, float64) and generic objects. Adds C++ bindings (CallableFunction, register, Module, CompiledModule, ModuleFunction, NautilusFunction) and corresponding Python registry entries for each 3-arg signature. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Mixed-type signatures (e.g. def f(x: int, y: float) -> float) are now auto-promoted to the widest numeric type so they map to an existing same-type C++ binding. Promotion order: bool < int32 < int64 < f32 < f64. While loops with val conditions work correctly when loop variables use compound assignment (+=, -=, *=) to preserve C++ object identity for the tracer's loop back-edge detection. static_range() provides a for-loop helper that yields ValInt32 values, unrolling the loop at trace time. https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a Python API for Nautilus val objects with pybind11 bindings:
operator overloading (arithmetic, bitwise, comparison, logical)
infers function signatures from Python type annotations
and simple if/else control flow
Build with: cmake -DENABLE_PYTHON_BINDINGS=ON ..
https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59