Skip to content

Add Python bindings for Nautilus val types via pybind11#213

Open
PhilippGrulich wants to merge 15 commits intomainfrom
claude/nautilus-val-api-oD8R6
Open

Add Python bindings for Nautilus val types via pybind11#213
PhilippGrulich wants to merge 15 commits intomainfrom
claude/nautilus-val-api-oD8R6

Conversation

@PhilippGrulich
Copy link
Copy Markdown
Member

Introduces a Python API for Nautilus val objects with pybind11 bindings:

  • Val types: ValInt32, ValInt64, ValFloat32, ValFloat64, ValBool with full
    operator overloading (arithmetic, bitwise, comparison, logical)
  • Engine API: NautilusEngine wrapper with @engine.compile decorator that
    infers function signatures from Python type annotations
  • Native control flow: Python if/else works via bool() → traceBool()
  • select() for branchless ternary operations
  • Type casting between val types (cast_to_float64, cast_to_int32, etc.)
  • 63 passing tests covering operators, booleans, casting, compilation,
    and simple if/else control flow

Build with: cmake -DENABLE_PYTHON_BINDINGS=ON ..

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59

claude added 5 commits March 31, 2026 12:28
Introduces a Python API for Nautilus val objects with pybind11 bindings:

- Val types: ValInt32, ValInt64, ValFloat32, ValFloat64, ValBool with full
  operator overloading (arithmetic, bitwise, comparison, logical)
- Engine API: NautilusEngine wrapper with @engine.compile decorator that
  infers function signatures from Python type annotations
- Native control flow: Python if/else works via __bool__() → traceBool()
- select() for branchless ternary operations
- Type casting between val types (cast_to_float64, cast_to_int32, etc.)
- 63 passing tests covering operators, booleans, casting, compilation,
  and simple if/else control flow

Build with: cmake -DENABLE_PYTHON_BINDINGS=ON ..

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Three root causes prevented Python-traced functions from handling nested
control flow:

1. Stack-based tags cannot distinguish Python call sites: The tracer
   uses __builtin_return_address to build unique tags per code position,
   but all Python calls through the same pybind11 dispatcher (e.g. two
   different __gt__ comparisons) produce identical stack backtraces.
   This caused false snapshot collisions in globalTagMap during the
   first trace iteration, triggering premature control-flow merge
   detection and corrupting the trace.

   Fix: Add ExternalPositionHint — a thread-local callback that the
   Python bindings set to return PyFrame_GetLasti() (the current Python
   bytecode offset). This value is XOR'd into the snapshot hash in
   recordSnapshot(), making snapshots from different Python source
   positions distinct while remaining stable across trace iterations.

2. Compound assignment created copies breaking loop detection: __iadd__
   and similar operators returned val objects by value, creating new
   traced objects each iteration. This changed the AliveVariableHash
   across loop iterations, preventing snapshot matching for loop
   back-edge detection.

   Fix: Return by reference with py::return_value_policy::reference_internal.

3. TraceTerminationException couldn't propagate through Python: pybind11
   converted the C++ exception to a generic Python RuntimeError, which
   couldn't be re-caught by the trace loop.

   Fix: Register TraceTerminationException as a Python exception type
   and re-throw it in each registration wrapper's catch handler.

All 71 Python tests and 129 C++ tests pass.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Relocate all Python-related code (pybind11 bindings, Python package,
and tests) to the project-root plugins/python directory. Update CMake
add_subdirectory path, include directories, and test sys.path entries
to match the new layout.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add NautilusModule bindings for batch-compiling multiple named functions
  into a single compilation unit (Module/CompiledModule Python wrappers)
- Add NautilusFunction bindings for first-class callable functions that
  are intercepted during tracing (nautilus_function decorator)
- Support 6 module signatures and 4 NautilusFunction signatures
- Add tests for multiple independent functions on same engine (3 tests)
- Add tests for NautilusModule batch compilation (6 tests)
- Add tests for NautilusFunction tracing interception (4 tests)
- Update __init__.py with Module, CompiledModule, and nautilus_function API

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing Benchmark

Details
Benchmark suite Current: f2fd897 Previous: a3f3e9c Ratio
trace_add 3.59653 us (± 576.485) 2.57955 us (± 268.144) 1.39
completing_trace_add 3.16848 us (± 666.515) 2.54844 us (± 288.334) 1.24
trace_ifThenElse 15.2576 us (± 3.03806) 12.1076 us (± 2.12239) 1.26
completing_trace_ifThenElse 7.83688 us (± 1.35849) 5.30708 us (± 652.345) 1.48
trace_deeplyNestedIfElse 44.8904 us (± 9.07869) 36.6173 us (± 7.37154) 1.23
completing_trace_deeplyNestedIfElse 20.8928 us (± 3.91357) 15.8437 us (± 3.45161) 1.32
trace_loop 15.7337 us (± 2.45831) 11.8298 us (± 1.86512) 1.33
completing_trace_loop 7.44274 us (± 1.22451) 5.39947 us (± 682.77) 1.38
trace_ifInsideLoop 30.9589 us (± 5.49375) 23.4552 us (± 3.32729) 1.32
completing_trace_ifInsideLoop 14.2496 us (± 2.62911) 10.2411 us (± 1.82447) 1.39
trace_loopDirectCall 15.7997 us (± 2.7565) 11.9955 us (± 1.81973) 1.32
completing_trace_loopDirectCall 8.11704 us (± 1.22071) 5.34824 us (± 671.043) 1.52
trace_pointerLoop 23.5504 us (± 4.28354) 18.3703 us (± 3.33774) 1.28
completing_trace_pointerLoop 15.7292 us (± 2.47179) 11.493 us (± 1.73451) 1.37
trace_staticLoop 16.9551 us (± 2.32025) 9.20622 us (± 1.0699) 1.84
completing_trace_staticLoop 15.9759 us (± 2.27471) 8.9735 us (± 1.10899) 1.78
trace_fibonacci 17.6814 us (± 2.78306) 14.1719 us (± 2.66331) 1.25
completing_trace_fibonacci 9.59973 us (± 1.42366) 6.59787 us (± 812.253) 1.45
trace_gcd 14.9798 us (± 2.18011) 11.0463 us (± 1.73258) 1.36
completing_trace_gcd 6.92007 us (± 1.15669) 4.65348 us (± 511.873) 1.49
trace_nestedIf10 79.8917 us (± 18.5807) 55.4514 us (± 8.08936) 1.44
completing_trace_nestedIf10 82.1429 us (± 20.8319) 55.6325 us (± 7.83651) 1.48
trace_nestedIf100 1.81277 ms (± 186.288) 1.81343 ms (± 205.134) 1.00
completing_trace_nestedIf100 1.81595 ms (± 162.212) 1.82224 ms (± 39.2657) 1.00
trace_chainedIf10 160.256 us (± 30.8356) 141.567 us (± 11.4945) 1.13
completing_trace_chainedIf10 89.3013 us (± 22.2154) 71.1739 us (± 10.1789) 1.25
trace_chainedIf100 5.2123 ms (± 175.106) 5.13347 ms (± 45.2127) 1.02
completing_trace_chainedIf100 2.69159 ms (± 165.37) 2.80508 ms (± 38.7944) 0.96
exec_mlir_add 3.9377 ns (± 0.468665) 3.47518 ns (± 0.405246) 1.13
exec_mlir_fibonacci 5.77976 us (± 686.495) 5.04737 us (± 1.08062) 1.15
exec_mlir_sum 508.806 us (± 13.6253) 578.145 us (± 64.0147) 0.88
exec_cpp_add 6.2782 ns (± 1.13576) 5.33205 ns (± 0.543434) 1.18
exec_cpp_fibonacci 95.6696 us (± 8.16666) 95.6188 us (± 9.36912) 1.00
exec_cpp_sum 35.9899 ms (± 256.502) 35.9484 ms (± 76.5862) 1.00
exec_bc_add 67.1796 ns (± 9.42535) 44.2215 ns (± 6.47502) 1.52
exec_bc_fibonacci 844.33 us (± 95.3422) 822.327 us (± 12.6536) 1.03
exec_bc_sum 175.959 ms (± 4.38406) 176.028 ms (± 659.013) 1.00
exec_asmjit_add 4.5988 ns (± 0.958504) 3.60814 ns (± 0.578975) 1.27
exec_asmjit_fibonacci 24.8318 us (± 4.65585) 21.6288 us (± 3.24613) 1.15
exec_asmjit_sum 4.77003 ms (± 366.309) 4.45697 ms (± 45.4201) 1.07
ssa_add 318.404 ns (± 52.9742) 194.671 ns (± 16.1002) 1.64
ssa_ifThenElse 716.596 ns (± 134.003) 481.598 ns (± 33.58) 1.49
ssa_deeplyNestedIfElse 1.44808 us (± 296.088) 1.25854 us (± 168.561) 1.15
ssa_loop 720.961 ns (± 142.979) 523.862 ns (± 48.3283) 1.38
ssa_ifInsideLoop 1243.6499999999999 ns (± 268836) 967.22 ns (± 92.4438) 1.29
ssa_loopDirectCall 622.47 ns (± 139.459) 521.524 ns (± 35.9294) 1.19
ssa_pointerLoop 769.924 ns (± 158.652) 645.002 ns (± 54.8434) 1.19
ssa_staticLoop 776.283 ns (± 102.914) 510.093 ns (± 48.6925) 1.52
ssa_fibonacci 583.81 ns (± 102.567) 547.878 ns (± 46.9578) 1.07
ssa_gcd 702.155 ns (± 163.612) 498.361 ns (± 58.7311) 1.41
comp_mlir_add 8.25743 ms (± 99.0941) 8.81596 ms (± 534.547) 0.94
comp_mlir_ifThenElse 8.8621 ms (± 161.185) 9.38068 ms (± 535.218) 0.94
comp_mlir_deeplyNestedIfElse 7.67862 ms (± 75.9376) 8.13313 ms (± 485.529) 0.94
comp_mlir_loop 9.76778 ms (± 108.885) 10.0294 ms (± 376.951) 0.97
comp_mlir_ifInsideLoop 31.1872 ms (± 397.64) 32.1179 ms (± 321.227) 0.97
comp_mlir_loopDirectCall 14.3121 ms (± 163.719) 15.1037 ms (± 329.851) 0.95
comp_mlir_pointerLoop 30.0807 ms (± 236.948) 31.5576 ms (± 340.952) 0.95
comp_mlir_staticLoop 7.68311 ms (± 115.074) 7.7594 ms (± 256.495) 0.99
comp_mlir_fibonacci 13.2116 ms (± 673.477) 13.4276 ms (± 339.291) 0.98
comp_mlir_gcd 12.0756 ms (± 135.092) 12.254 ms (± 311.131) 0.99
comp_mlir_nestedIf10 13.0973 ms (± 121.297) 13.5831 ms (± 902.405) 0.96
comp_mlir_nestedIf100 27.5084 ms (± 466.113) 28.6773 ms (± 317.1) 0.96
comp_mlir_chainedIf10 12.1837 ms (± 165.951) 12.6642 ms (± 379.543) 0.96
comp_mlir_chainedIf100 23.1921 ms (± 323.82) 23.8916 ms (± 324.958) 0.97
comp_cpp_add 25.42 ms (± 820.564) 25.8794 ms (± 368.535) 0.98
comp_cpp_ifThenElse 25.8234 ms (± 831.268) 27.5994 ms (± 1.11786) 0.94
comp_cpp_deeplyNestedIfElse 26.6592 ms (± 832.736) 28.1188 ms (± 667.155) 0.95
comp_cpp_loop 25.7914 ms (± 708.893) 26.7929 ms (± 361.011) 0.96
comp_cpp_ifInsideLoop 26.884 ms (± 747.379) 27.855 ms (± 437.402) 0.97
comp_cpp_loopDirectCall 26.4368 ms (± 623.294) 27.3539 ms (± 383.478) 0.97
comp_cpp_pointerLoop 26.5562 ms (± 578.606) 27.1963 ms (± 330.599) 0.98
comp_cpp_staticLoop 25.6837 ms (± 603.659) 26.1362 ms (± 340.064) 0.98
comp_cpp_fibonacci 26.049 ms (± 529.325) 26.6565 ms (± 383.856) 0.98
comp_cpp_gcd 25.5785 ms (± 756.878) 26.3853 ms (± 522.799) 0.97
comp_cpp_nestedIf10 28.4524 ms (± 505.847) 29.3989 ms (± 349.145) 0.97
comp_cpp_nestedIf100 61.6504 ms (± 503.445) 62.5396 ms (± 487.132) 0.99
comp_cpp_chainedIf10 30.6603 ms (± 499.985) 32.1014 ms (± 671.871) 0.96
comp_cpp_chainedIf100 91.7875 ms (± 1.76442) 92.7786 ms (± 1.07619) 0.99
comp_bc_add 14.8319 us (± 2.10792) 14.781 us (± 2.37726) 1.00
comp_bc_ifThenElse 17.9507 us (± 2.66861) 18.3066 us (± 3.31947) 0.98
comp_bc_deeplyNestedIfElse 22.8133 us (± 3.46982) 21.973 us (± 3.41507) 1.04
comp_bc_loop 18.5566 us (± 3.50124) 18.2333 us (± 3.69041) 1.02
comp_bc_ifInsideLoop 20.8252 us (± 2.73671) 20.8739 us (± 4.33151) 1.00
comp_bc_loopDirectCall 19.1656 us (± 3.29112) 19.2007 us (± 3.78413) 1.00
comp_bc_pointerLoop 20.1164 us (± 3.16482) 19.8644 us (± 3.77911) 1.01
comp_bc_staticLoop 17.1578 us (± 2.85176) 16.7879 us (± 3.38173) 1.02
comp_bc_fibonacci 18.6145 us (± 2.77741) 18.3663 us (± 3.27136) 1.01
comp_bc_gcd 25.3225 us (± 6.83165) 18.0152 us (± 2.76342) 1.41
comp_bc_nestedIf10 35.6392 us (± 4.26413) 35.8522 us (± 4.28134) 0.99
comp_bc_nestedIf100 177.235 us (± 7.4267) 181.097 us (± 11.8962) 0.98
comp_bc_chainedIf10 51.0544 us (± 6.53563) 49.3702 us (± 7.50218) 1.03
comp_bc_chainedIf100 286.479 us (± 12.7824) 290.818 us (± 17.3544) 0.99
comp_asmjit_add 21.0391 us (± 2.63754) 21.188 us (± 4.87693) 0.99
comp_asmjit_ifThenElse 34.7083 us (± 5.39809) 33.5501 us (± 5.38552) 1.03
comp_asmjit_deeplyNestedIfElse 65.5515 us (± 11.3756) 57.7212 us (± 9.17995) 1.14
comp_asmjit_loop 38.8615 us (± 5.93744) 35.6318 us (± 5.27654) 1.09
comp_asmjit_ifInsideLoop 60.8 us (± 10.0116) 59.4218 us (± 11.9057) 1.02
comp_asmjit_loopDirectCall 48.8647 us (± 10.0422) 46.0328 us (± 9.68253) 1.06
comp_asmjit_pointerLoop 53.2866 us (± 9.62254) 48.8339 us (± 10.9407) 1.09
comp_asmjit_staticLoop 30.821 us (± 5.15104) 28.3494 us (± 4.25178) 1.09
comp_asmjit_fibonacci 45.5484 us (± 6.69808) 43.6687 us (± 8.65062) 1.04
comp_asmjit_gcd 38.5195 us (± 6.24653) 35.1428 us (± 5.70744) 1.10
comp_asmjit_nestedIf10 123.537 us (± 23.2415) 109.174 us (± 12.5732) 1.13
comp_asmjit_nestedIf100 1.17893 ms (± 42.9443) 1.14775 ms (± 40.6487) 1.03
comp_asmjit_chainedIf10 181.791 us (± 25.0632) 165.37 us (± 18.1415) 1.10
comp_asmjit_chainedIf100 2.39446 ms (± 81.7349) 2.26744 ms (± 44.0847) 1.06
ir_add 1307.34 ns (± 237174) 854.33 ns (± 81.6744) 1.53
ir_ifThenElse 3.53007 us (± 754.951) 2.671 us (± 456.984) 1.32
ir_deeplyNestedIfElse 8.01014 us (± 1.93109) 6.88012 us (± 823.252) 1.16
ir_loop 3.74549 us (± 851.188) 3.04024 us (± 373.687) 1.23
ir_ifInsideLoop 6.9274 us (± 1.7336) 5.81858 us (± 746.136) 1.19
ir_loopDirectCall 4.35151 us (± 907.141) 3.2659 us (± 376.995) 1.33
ir_pointerLoop 4.51983 us (± 935.667) 3.7749 us (± 569.99) 1.20
ir_staticLoop 2.88822 us (± 623.168) 2.25725 us (± 158.869) 1.28
ir_fibonacci 4.26741 us (± 881.408) 3.17196 us (± 234.985) 1.35
ir_gcd 3.6699 us (± 765.129) 2.66647 us (± 302.248) 1.38
ir_nestedIf10 17.204 us (± 4.31678) 16.1584 us (± 1.86538) 1.06
ir_nestedIf100 184.761 us (± 11.7132) 193.558 us (± 8.17045) 0.95
ir_chainedIf10 31.4909 us (± 6.07932) 29.8691 us (± 1.67191) 1.05
ir_chainedIf100 365.465 us (± 14.7672) 370.333 us (± 13.0525) 0.99

This comment was automatically generated by workflow using github-action-benchmark.

claude added 10 commits March 31, 2026 21:26
Users can now write `def f(x: int) -> int` instead of
`def f(x: ValInt32) -> ValInt32`. Python types are automatically
mapped: int->ValInt32, float->ValFloat64, bool->ValBool.

Type aliases (nautilus.int64, nautilus.float32, etc.) provide
explicit bit-width control. Existing Val-type annotations still work.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
All three APIs (Engine, Module, NautilusFunction) now support the
complete set of 10 signatures: i32, i32x2, i64, i64x2, f64, f64x2,
f32, f32x2, bool, and i32->bool predicate.

Previously Module was missing f32, f32x2, bool, i32->bool and
NautilusFunction was missing i64x2, f64x2, f32, f32x2, bool, i32->bool.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Wrap arbitrary Python objects as val<void*> and forward all operations
(arithmetic, comparison, bool conversion) to the Python runtime via
invoke() trampolines that call the Python C API. A thread-local arena
tracks intermediate PyObject* references for correct refcounting.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Implement .call("method", args...) and .getattr("name") on ValObject,
enabling compiled Python code to invoke arbitrary methods on generic
objects. Uses invoke() with GIL-acquiring trampolines that call
PyObject_GetAttr + PyObject_Call. Supports 0-3 arguments.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Bug fixes:
- Fix use-after-free in all 4 object call sites: reinterpret_borrow
  did not incref, but arenaClear decreffed, leaving a dangling pointer.
  Now Py_XINCREF result before arenaClear + use reinterpret_steal.
- Fix PyObject_RichCompareBool error handling: -1 (error) was silently
  mapped to false; now calls PyErr_Clear on error.
- Fix PyTuple_Pack null check in call() trampolines: null would
  segfault on PyObject_Call.

Simplification:
- Extract 15 named trampoline functions (pyAdd, pySub, pyEq, etc.)
  to py_object_helpers.hpp, replacing ~65 inline lambdas.
- Template makeCppFunc and PyNautilusFunc, replacing 12 copy-paste
  functions and 12 copy-paste structs with one template each.
- Use REGISTER_NAUTILUS_FUNC macro for pybind11 class registrations.

Added missing operator overloads for ValObject consistency:
- __rtruediv__, __rfloordiv__, __rmod__ (reverse operators)
- __ne__, __le__, __ge__ with py::object overloads
- __isub__, __imul__ with py::object overloads
- __floordiv__, __mod__ with py::object overloads

Net: -327 lines across 3 files.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add pyproject.toml enabling `pip install -e .` for the Python plugin
- Add tests/conftest.py that auto-discovers the CMake build directory
- Remove hardcoded sys.path hacks from all 7 test files

Usage after CMake build:
  cd plugins/python && pip install -e .
  pytest tests/

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Generated by pip install -e . and should not be tracked.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
- Add Options.set() method that auto-detects Python types (bool, int,
  float, str) and Options.set_double() for completeness
- Engine now accepts **kwargs for compiler options with underscore-to-dot
  conversion: Engine(dump_console=True) → dump.console=True
- Engine also accepts an Options object: Engine(options=opts)
- Add 5 tests covering kwargs, Options object, type detection, invalid
  type rejection, and dump options

Usage:
  engine = Engine(backend="mlir", mlir_optimizationLevel=2)
  engine = Engine(dump_console=True, dump_after_tracing=True)

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Extends the Python plugin to support functions with 3 parameters for all
numeric types (int32, int64, float32, float64) and generic objects. Adds
C++ bindings (CallableFunction, register, Module, CompiledModule, ModuleFunction,
NautilusFunction) and corresponding Python registry entries for each 3-arg
signature.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Mixed-type signatures (e.g. def f(x: int, y: float) -> float) are now
auto-promoted to the widest numeric type so they map to an existing
same-type C++ binding. Promotion order: bool < int32 < int64 < f32 < f64.

While loops with val conditions work correctly when loop variables use
compound assignment (+=, -=, *=) to preserve C++ object identity for
the tracer's loop back-edge detection.

static_range() provides a for-loop helper that yields ValInt32 values,
unrolling the loop at trace time.

https://claude.ai/code/session_015Nypj18LoyGGvDWQ9amU59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants