Experiment: Hybrid JIT - Pystons DynASM JIT + CPythons Stencil JIT#146235
Experiment: Hybrid JIT - Pystons DynASM JIT + CPythons Stencil JIT#146235undingen wants to merge 2 commits intopython:mainfrom
Conversation
Replace the copy-and-patch relocation engine with a DynASM-based pipeline.
Instead of manually copying pre-compiled stencil blobs and patching GOT
entries / trampolines at runtime, Clang-generated assembly is converted at
build time into DynASM .dasc source, which is then compiled into a C header
(jit_stencils_dynasm.h). At runtime the DynASM assembler encodes native
x86-64 directly, resolving all labels, jumps, and data references in a
single pass.
Key changes:
Build pipeline (Tools/jit/):
- _asm_to_dasc.py: New peephole optimizer that converts Clang AT&T asm
to DynASM Intel-syntax .dasc. Uses typed operand classes (Reg, Mem,
Imm) with Python 3.10+ match/case for pattern matching. Includes
15+ optimization patterns (immediate narrowing, test-self elimination,
indexed memory folding, ALU immediate folding, redundant stack reload
elimination, dead label removal, etc.).
- _dasc_writer.py: Generates jit_stencils.h with DynASM preamble,
emit helpers (emit_mov_imm, emit_call_ext, emit_cmp_reg_imm,
emit_test/and/or/xor_reg_imm), and per-stencil emit functions.
- _targets.py: Reworked to drive the DynASM pipeline — compiles
stencils, converts asm, generates .dasc, runs the DynASM preprocessor,
and produces the final header.
- _stencils.py: Adds COLD_CODE HoleValue for hot/cold section splitting.
- _optimizers.py: Extended with stencil frame-size tracking and
frame-group merging infrastructure.
- build.py: Adds --peephole-stats flag for optimization statistics.
- test_peephole.py: unit tests covering peephole patterns and
the line classification infrastructure.
- Lib/test/test_jit_peephole.py: Hooks peephole tests into make test.
Runtime (Python/jit.c):
- Complete rewrite of _PyJIT_Compile: uses DynASM dasm_init / dasm_setup
/ per-stencil emit / dasm_link / dasm_encode instead of memcpy+patch.
- Hot/cold code splitting: cold (error) paths are placed in a separate
DynASM section after the hot code, improving i-cache locality.
- Frame merging: stencils share a single prologue/epilogue, eliminating redundant rsp adjustments.
- SET_IP delta encoding: incremental IP updates avoid redundant full
address loads.
- Hint-based mmap: jit_alloc() places JIT code near the CPython text
segment for short (±2 GB) RIP-relative calls and LEAs.
- jit_shrink(): releases unused pages at the end of each compiled trace.
- emit_call_ext: emits direct RIP-relative call when target is within
±2 GB, otherwise falls back to indirect call through register.
- emit_mov_imm: picks the shortest encoding (xor/mov32/mov64/lea rip)
based on the runtime value.
Freelist inlining (Tools/jit/jit.h + template.c):
- Macro overrides redirect float/int allocation and deallocation to
JIT-inlined versions that directly access the thread-state freelists,
avoiding function call overhead for the most common object types.
- _PyJIT_FloatFromDouble / _PyJIT_FloatDealloc: inline float freelist.
- _PyJIT_LongDealloc / _PyJIT_FastDealloc: inline int/generic dealloc.
- _PyJIT_CompactLong_{Add,Subtract,Multiply}: inline compact long ops.
- PyStackRef_CLOSE / Py_DECREF overrides use the fast dealloc path.
LuaJIT submodule:
- Added as Tools/jit/LuaJIT for the DynASM assembler (dynasm/ only
used at build time; no LuaJIT runtime code is linked).
This is an experimental port, currently tested on x86_64 Linux only.
The approach is a hybrid between Pyston's fully hand-written DynASM JIT
(https://github.com/pyston/pyston/blob/pyston_main/Python/aot_ceval_jit.c)
and CPython's Clang-generated stencils: Clang produces the stencil
assembly, and DynASM handles encoding and relocation at runtime.
…true divide, power, floor divide, modulo
Add a proper specialization-based approach for inplace float modification.
This uses the existing _PyBinaryOpCache.external_cache[0..2] to store
profiling hints from the specializer, which the optimizer reads to select
the best tier2 inplace variant.
The key insight: when a float binary operation produces a result and one
of the input operands has refcount 1 (or 2 when a STORE_FAST targets it),
we can modify that float object in-place instead of allocating a new one.
This eliminates allocation overhead in common patterns like:
x += y # STORE_FAST_LEFT: left operand is the local
x = a + b # generic: whichever operand has refcount 1
total += a * b # chained: intermediate has refcount 1
Specializer (specialize.c):
- binary_op_float_inplace_candidate(): checks if either operand has
refcount 1 at specialization time
- binary_op_float_inplace_store_fast_hint(): checks if next instruction
is STORE_FAST targeting left (source=1) or right (source=2) operand
- Stores hints in external_cache[0]=use_inplace, [1]=source, [2]=local_index
New tier2 ops:
- Float/int mixed: _BINARY_OP_{ADD,SUBTRACT,MULTIPLY,TRUE_DIVIDE}_FLOAT_INT
- Float true divide: _BINARY_OP_TRUE_DIVIDE_FLOAT (with zero check)
- Float power: _BINARY_OP_POWER_FLOAT (positive base, finite exponent)
- Int floor divide: _BINARY_OP_FLOOR_DIVIDE_INT (Python semantics)
- Int modulo: _BINARY_OP_MODULO_INT (Python semantics)
- Inplace float: _BINARY_OP_INPLACE_{TRUE_DIVIDE,POWER}_FLOAT
- Inplace int: _BINARY_OP_INPLACE_{ADD,SUBTRACT,MULTIPLY}_INT
Architecture: The specializer writes JIT hints to external_cache[3] (an enum
indicating which tier2 op to use) and calls unspecialize() — the interpreter
runs the generic BINARY_OP path while the JIT optimizer reads the hints and
emits the specialized tier2 op. This avoids wasting interpreter opcode slots
while still getting full JIT specialization.
Also adds _PyCompactLong_FloorDivide() and _PyCompactLong_Modulo() helpers
in Objects/longobject.c with correct Python floor-division/modulo semantics
(sign correction for negative operands).
|
The following commit authors need to sign the Contributor License Agreement: |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
@picnixz why did you close this? I think leaving it open just as a read is interesting. |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
Do not create PRs for PoCs. In addition this looks too big for a change without an issue, a PEP and a discussion. |
You do not need a PR for PoCs. This wastes CI resources whenever there is a change and this diverts our attention to works we want to merge. |
We could just ask the author to not push any changes to this branch. |
|
No, please. We already have too many PRs. Or at least create an issue for that. But we really try to avoid making this in general. And IIRC we did ask people not to do that in the devguide. If you do want to sponsor that PR though please create an issue so that the feedback is not lost in the PR. |
No an issue is for something actionable. This PR is quite valuable in that it teaches a lot, but isn't an actionable item on its own. I will close the PR next Wednesday. So you don't have to worry about that. |
|
@undingen thanks a lot for running this experiment. It's really valuable info: Could you please post the geometric mean speedup you get from
In that case, the real speedups are probably slightly lower, because |
Experimental proof-of-concept, currently only working on x86-64 Linux. It is not intended for merging or detailed review, but rather to demonstrate the direction CPython's JIT could take and make a case for why this approach is worth pursuing.
This PR replaces CPython's copy-and-patch JIT with one using DynASM while keeping Clang-compiled C stencil templates as the source of the assembly. No hand-written assembly opcodes are required - we get the best of both worlds: LLVM-quality machine code with DynASM's runtime flexibility for encoding, linking, and layout.
Background
While I haven't worked on Pyston in many years, I always had in the back of my mind that I wanted to port some of the things that worked well over to CPython so everyone could benefit. Pyston's JIT used DynASM with hand-written assembly for every micro-op - it gave precise control over code generation but was an large maintenance/porting burden.
I finally found some time over the last few weeks to explore this, and with the help of Claude Opus 4.6, I was able to carry out this experiment. I mostly described what should be implemented and how, gave it Pyston as a reference, and let it implement things. Before AI assistance this wouldn't have been realistic given the time constraints I have.
One thing which I remember which made Pyston fast: stay as much time as possible inside the JIT but make sure that code size does not explode.
In practice what this meant: inline caches, inlining combined with splitting hot and cold code apart. Lots of tiny optimization which individiually don't really show up but all together increase performance noticable.
And I found DynASM was the right tool for that, because it generates code super fast and is flexible and quite easy to use.
It makes for example hot/cold splitting super easy as you will see further down.
A lot has changed over the years in CPython so I only focused on the things I know speedup Pyston and where I could see an easy way how to integrate it into currents CPython. But there are definitely more things which can be ported but for this experiment it should be enough.
Also there are currently quite a few peephole optimizations which work directly at the asm instructions I think likely some of them can be removed without any performance difference...
I will go further down more into details, I'm looking forward to feedback but I don't know how much more time I'll be able to spend on this.
Note: Everything should work normally except that the reference tracer (
_PyReftracerTrack) is disabled in the JIT paths.Performance
Compiled with
./configure --enable-experimental-jit=yeson Intel Core Ultra 7 155H laptop, turbo boost disabled.pyperformance run sorted by speedup only showing the significant ones:
Note:
this PR contains two commits the second one inplace modifies some float objects with refcnt==1 (Pyston had this also)
it's mainly here to show that the new JIT can generate okay code with the hot/cold splitting.
This is the perf changes from the first commit to second:
How to try it out:
You need to have luajit and llvm-21-dev installed:
While I added luajit for DynASM as a git submodule - it comes with minilua.c which one could use instead of installing luajit but I did not hook it up in the Makefile so installing luajit is necessary right now...
Note: lua is only used during CPython build time to run the DynASM dasc preprocessor which spits out C code, its not used at runtime.
How It Works
Stencil Build Pipeline
Whats new:
The DynASM
|Syntax inside the jit_stencils-x86_64-unknown-linux-gnu.dasc fileLines with
|emit machine code at runtime; everything else is normal C:E.g. Load a value into a register
This is impossible with copy-and-patch: values aren't known at build time, so every load must be worst-case 10 bytes. With DynASM, we pick the optimal encoding at JIT compile time.
What a Stencil Looks Like
The
instruction->oparg * 8 + 80offset is computed at JIT compile time and folded into a single addressing mode.The current CPython JIT replicates some opcodes to generate optimized code for the most commen opargs
DynASM always lets us the best encoding.
What Changed vs. Copy-and-Patch
Removed
DynASM handles this automatically
=>NPC labels.code(hot) /.coldwith automatic placementcall rel32with correct displacementdasm_link()resolves everything in one passKey Optimizations
1. Hot/Cold Splitting
Error paths and deoptimization stubs go in the
.coldsection, which DynASM lays out after all hot code:This keeps hot code compact for better I-cache utilization.
2. Runtime-Optimal Immediates
Most type objects and runtime functions land in the 5 or 7-byte tier, saving 3–5 bytes each vs. the old fixed 10-byte
movabs.3. LLVM Fold Pass + Peephole Optimizer
An LLVM pass (
jit_fold_pass.cpp) folds JIT-time constants directly in IR - tracing SSA chains from_JIT_OPARG,_JIT_OPERAND0/1and_PyRuntime, folding GEP address arithmetic, and emitting compact inline-asm markers. The downstream peephole optimizer then fuses assembly patterns:LOAD_SMALL_INT - before (4 instructions): mov + shl + mov + add
After (1 instruction - entire address computed at JIT time):
Things like this happen in a lot of stencils e.g. emit__LIST_APPEND_r10...
Convert the boolean result into Py_False or Py_True:
4. Direct Calls
We hint to mmap() an address for the JIT code is allocated within ±2GB of CPython's text segment,
so nearly all calls use the shorter direct form.
5. Shared Trace Frame
All stencils in a trace share a single stack frame. At trace entry we do
push rbp; mov rbp, rsp; sub rsp, Nonce; individual stencil prologues/epilogues are stripped at build time. Each trace exit inlinesmov rsp, rbp; pop rbp; ret. This eliminates per-stencil frame setup overhead on the hottest paths.6. Float Inplace Operations
Experimental
_BINARY_OP_INPLACE_ADD_FLOATand friends modify float values in-place when the refcount is 1.The hot path reduces to not much more than single SSE instruction:
I told the AI to implement it in a way which does not use up opcodes (because I assumed they are still very limited but now I'm not sure if this was necessary maybe it could have implemented it in a more straightforward way...
What DynASM Enables for the Future
Beyond the current optimizations, DynASM gives us a flexible code generation foundation that copy-and-patch fundamentally cannot provide. Since we control instruction emission at runtime, future work can:
What This Cannot Solve
This work is purely a backend improvement. Higher-level optimizations must happen elsewhere:
Profiling shows JIT-emitted code accounts for ~25 of runtime on numeric benchmarks. The remaining 75% is C runtime functions (PyFloat_FromDouble, _Py_Dealloc, etc.). Backend optimizations can't produce dramatic speedups alone, but they ensure generated code is as tight as possible and remove the backend as a bottleneck for higher-level improvements.
Testing
test_pyreplfailure)