Skip to content

Pulse v2#2167

Draft
kali wants to merge 22 commits intomainfrom
pulse-v2
Draft

Pulse v2#2167
kali wants to merge 22 commits intomainfrom
pulse-v2

Conversation

@kali
Copy link
Copy Markdown
Collaborator

@kali kali commented Apr 29, 2026

No description provided.

kali added 22 commits April 29, 2026 07:45
Introduce the PulseV2 abstraction alongside the existing pulse code.
Instead of tracking a single streaming axis with fixed pulse size and
startup delay, PulseV2 describes per-axis regions (what data an op
produces at pulse T) with start/end as TDim expressions in T and P.

For now: AxisRegion, PulseV2Region, PulseV2Fact, and a basic
from_batch_input constructor for simple streaming inputs.
Add run_pulse_v2() which naively slices input into pulse-sized chunks,
runs the batch model concretized to pulse size, and stitches output.

Identity test passes. Conv test fails because the naive slicing doesn't
provide the overlap (kernel-1 samples) that the conv needs from the
previous pulse. This is the buffering problem PulseV2 needs to solve.
PulseV2Model wraps a TypedModel with per-wire region metadata.
PulseV2Model::new() pulsifies a batch model (currently: Source and
trivial pass-through only). into_typed() lowers back to TypedModel.

Test harness now goes through PulseV2Model instead of ad-hoc
concretization. Identity test passes; conv test still fails
(no buffering yet).
PulseV2Buffer is a stateful op that stores previous pulse tensors
in a ring buffer and stitches them with the current input along a
given axis. At T=0, zero-pads the lookback region.

PulseV2Model::new() detects Conv ops and inserts PulseV2Buffer on
the data input with lookback = (kernel-1) * dilation.

Both identity and conv tests pass. Conv test: kernel=3, pulse=4,
input=8 samples, valid padding. Buffer provides 2 samples of
lookback from the previous pulse, conv produces 4 valid frames
per pulse (plus 2 startup frames to skip from zero padding).
PulseV2Buffer output_facts uses Min(T*pulse, lookback) to express the
variable lookback: at T=0, no lookback (output = P); at steady state,
full lookback (output = P + lookback). The runtime resolves T per pulse.

Conv at T=0 produces P-K+1 valid frames, at T≥1 produces P frames.
Every frame is valid — no garbage, no startup delay to discard.

This is the core PulseV2 contract: variable-size increments, no waste.
Introduce PulsifierV2 trait + inventory::collect! for registering
V2 pulsifiers. Each pulsifier is a function that receives a
PulseV2Context (batch model, target typed model, mapping, symbols)
and wires the pulsed version of an op.

Conv pulsifier moved from hardcoded logic in PulseV2Model::new()
to a registered pulsifier in v2_conv.rs via register_pulsifier_v2!.

PulseV2Model::new() now: Sources → try inventory → default pass-through.
Passing tests:
  - Identity (Source → output)
  - Conv kernel=3 pulse=4 (original)
  - Conv kernel=2 pulse=2
  - Conv chain: kernel=2 → kernel=3, pulse=3

Known gaps (ignored):
  - Conv pulse=1 kernel=3 (P < K: no output at T=0, needs empty-output)
  - Slice/Crop (batch coordinates applied to pulse-sized tensor, needs
    region transform)
PulseV2Pad: stateless op that emits before-padding at T=0 alongside data,
after-padding at the pulse where (T+1)*P > S. output_facts uses TDim
expressions with Eq(T,0), Min, Max — no fresh symbols.

TDim fix: Min/Max inclusive_bound used filter_map which silently dropped
terms with unknown bounds. For Max upper bound, an unknown term means the
result is unknown (not the max of known terms). Same for Min lower bound.
This caused Min(1, Max(0, X)) to incorrectly simplify to Max(0, X).

Runner: now feeds extra pulses past the input until the batch output
length is reached, so after-padding is fully emitted.

6/6 proptests passing: single conv, conv chain, dilation, stride, crop, pad.
Pad + Conv combo works without extra code: the generic pulsifier
inserts PulseV2Pad then PulseV2Buffer + Conv automatically.

Also added full pad_plus_conv proptest from V1 harness (ignored, needs
stride + edge padding support).
MaxPool: shares pool_spec_input_regions with Conv — zero new logic.

Downsample + deconv: added run_v2 and full proptests (both ignored).
- Downsample fails: stride alignment + runner loop issue
- Deconv needs output overlap buffer (new pattern, not just input lookback)

7/7 passing, 4 ignored full proptests.
Buffer output_facts uses a fresh H symbol (0 <= H <= lookback) instead
of min(T*P, lookback). H resolves at runtime from the actual tensor
size. This fixes the intermediate buffer overestimate.

Clamp probes multiple worst cases (T=0/100, P=1/100, H=0/1).
S >= P assertion helps TDim simplification.

7/7 focused proptests pass. Full conv proptest hangs — TDim
simplification of deeply nested H/Min/Max/Eq expressions is too
slow (Mul distribution causes exponential expression growth).
Needs either distribution depth limit or simpler expressions.
Per-node clamping wraps every conv output with max(0, …) which gives the
TDim simplifier nested Max trees that blow up exponentially on deep conv
chains. Intermediate shapes can stay flat: any startup-time negative
value is irrelevant inside the graph since streaming-axis allocation
happens at the sinks. Clamp only the final pulsed outputs.

30-layer dilated conv chain: 99 s → 0 ms (the latter scales linearly).
…, hey_snips

dilated_chain_grows and skip_residual_chain_grows are quick-running
synthetic harnesses that print per-iteration time and shape size for
PulseV2Model::new on increasingly deep models. Useful to localise
TDim simplifier blowup signatures.

hey_snips_v2 is the real-model smoke (currently #[ignore]d — hangs at
the first skip-Add, where parallel paths with different H_i chains
trigger broadcast Max nesting).
Drop the H_i symbolic-history mechanism and the variable-pulse-size
contract that came with it. Buffer now emits a constant `input + lookback`
on its buffered axis, matching v1's `Delay`:

- output_facts: shape on the buffered axis is `input_dim + lookback` (a
  concrete usize add). No fresh H symbol, no scope assertions.
- state: a single history tensor of shape `lookback` on the buffered
  axis, initialised to zeros at session start.
- eval: output = concat(history, input) on the buffered axis; new
  history = last `lookback` samples of the output.

This means at T=0 the conv-after-buffer sees `input + lookback` zeros-
prefixed samples and produces a full `(input + lookback - K + 1)/stride`
output — fixed-shape ramp with garbage prefix that's discarded at the
sink, instead of the variable-size ramp the H_i scheme produced.

Single-buffered-axis assumption: the streaming axis is the only one
that gets a non-zero lookback. `buffered_axis()` debug-asserts at most
one such axis. Multi-axis lookback is left for a future revision.

The `depth` field is gone; the ring-buffer-of-tensors representation
goes away with it.

Also: harness `run_and_compare_v2` now computes `total_delay` (sum of
buffer lookbacks on the streaming axis) and skips that many samples
from the front of the pulsed output before comparing — those samples
are zero-prefill garbage. Run loop bumped to drain `batch_len +
total_delay` total samples.

Net on the focused proptests:
- single_conv, conv_chain, conv_dilation, crop, pad, pad_plus_conv: 6/7 pass.
- conv_stride: fails on stride > 1 because total_delay is naive (sums
  raw lookbacks without dividing by stride). Stride-aware delay
  accounting is a follow-up.
Each `PulseV2Buffer` pre-fills `lookback` zeros on the input axis. After
a `Conv(stride=s)`, those garbage samples land on the output axis as
`lookback / s` zero-prefilled output positions (because stride compresses
the output). Generalise to chained ops by walking the pulsed model:

- accumulate `total_lookback` from every PulseV2Buffer
- accumulate `total_stride` as the product of streaming-axis strides on
  every Conv node
- output-axis delay = `total_lookback / total_stride`

Holds for the proptest models (single linear streaming path with conv +
buffers + pad + crop). Real graphs with branching would need per-buffer
downstream-stride accounting; flagging that as a follow-up in the comment.

`proptest_v2_conv_stride` recovers; the v2 focused suite is back to 7/7.
`into_optimized()` decomposes Conv into Im2col + OptMatMul and inserts
AxisOp wrappers that pulse-v2's RegionTransform inventory doesn't handle
— those ops are concrete-allocation kernels, not the ops we want to
pulsify against. Drop the call.

Still `#[ignore]`'d — the new fixed-pulse Buffer alone doesn't unblock
hey_snips because PulseV2Slice still emits variable-pulse output that
differs syntactically from the conv side's constant P, so broadcast at
the skip-Add still walks into the Max chain blowup. The ignore message
is updated to point at the actual remaining gap (Slice rewrite).
Same approach as v1's PulsedAxisSlice: on the streaming axis, Slice is
metadata bookkeeping. The runtime tensor passes through unchanged
(constant `P` per pulse); the slice's `start` is the v2-flavoured
equivalent of v1's `delay`, tracked at sink/merge time rather than as a
per-pulse shape.

- `PulseV2Slice::output_facts` returns input fact unchanged.
- `PulseV2Slice::eval_with_session` returns input as-is.
- The `start` and `end` fields are kept on the op for downstream
  inspection (the harness sums slice starts on the streaming axis into
  its `total_delay` so batch-vs-pulsed comparison realigns correctly).

Net effect on the wavenet smoke test: `hey_snips_pulsify_v2` now walks
the WaveNet graph end-to-end in 0.16 s. The original blowup at chained
skip-Adds is gone — both sides arrive at the Add with the same
constant-`P` declared shape, broadcast unifies trivially, no Max-tree
nesting. Un-ignored.

Pad on the streaming axis is left as-is for now; wavenet's typed-
decluttered form doesn't have any after `decompose_streaming_padding`
splits Conv-with-padding into Pad on non-streaming + Conv valid. A
similar fixed-pulse rewrite for PulseV2Pad will be needed once a model
actually uses streaming-axis padding.

7 v2 proptests still pass (proptest_v2_crop exercises the new Slice
pass-through).
Replace the smoke-only test with a real correctness check: run the
batch model and the pulsified v2 model on the same random input,
collect per-pulse outputs, and assert that the steady-state tail of
the pulsed output matches batch within float tolerance.

The trick to avoid worrying about cumulative delay: run pulses just
long enough to consume the full input stream (`stream_len.div_ceil(pulse)`
pulses); the tail `batch_len` samples of the pulsed output are aligned
with the full batch output because the pulsed runtime emits a fixed
size per pulse (post-buffer/post-slice) and the stream is fully
consumed by the last pulse.

Wavenet inventory after pulsification: buffers=49 convs=49 slices=46.
Pulsed output (256) — batch output (74) = 182 = wavenet's effective
receptive field on the streaming axis. Comparison: 74/74 samples
within `close_enough` tolerance.

This is the wavenet milestone: v2 produces numerically the same
output as the unpulsified model, on the historical test case that
v1 was originally designed for. v2 generalises v1 and fixed-pulse
streaming-axis semantics.
Mirrors the existing `--pulse [PULSE]` block: a new `--pulse-v2 [SYM]`
flag that takes the streaming axis symbol name (default "S") and
registers a `pulse-v2` stage that runs `PulseV2Model::new` and
`into_typed()`, plus a passthrough `pulse-v2-to-type` stage so users
can target the lowered form via `--pass`.

Lower-side only: the run / compare side (per-pulse driver, P/T/S
symbol resolution at runtime, output stitching) still needs to be
plumbed; today only the dump path is useful for inspection.
PulseV2Slice on the streaming axis is a no-op at runtime — its `start`
and `end` are just metadata for consumers walking the graph during
pulsification. Once the model is declutter-ready, the marker has done
its job. Implement `TypedOp::declutter` as `shunt_one_op`, mirroring
`Identity`. Also fix `slice_transform` to skip the replacement entirely
when the slice's axis isn't the streaming axis (the original `Slice`
runs correctly there — no need for our pass-through wrapper).

Add a `pulse-v2-declutter` CLI stage so `--pass pulse-v2-declutter`
gives the same lowered form `pulse-declutter` does for v1.

Net effect on the wavenet topology comparison:

  | op                              | v1 | v2 pre-declutter | v2 post-declutter |
  | streaming-state ops             |  25|              49  |              25   |
  | streaming-axis Slice markers    |   0|              46  |               0   |
  | EinSum/Conv/Add/Mul/Tanh/...    |    identical across all three             |

Post-declutter, v2 reaches the same op-count topology as v1: 25 state-
bearing streaming ops, no pass-through markers. The decluttering also
collapsed 49 PulseV2Buffer down to 25 — the surrounding declutter rules
deduplicate buffers that share the same upstream source (presumably the
fan-out predecessors that wavenet's residual stacks share).

The `pulse-v2-declutter` stage uses the standard `into_decluttered()`
pass — the magic is just that `PulseV2Slice::declutter` now signals
"shunt me" so the rest of the declutter pipeline can fold around it.

Test status: 7 v2 proptests still pass; hey_snips numerical match
unchanged; pre-declutter graph still has the 46 Slice markers (so
external code that wants to read `start`/`end` metadata still can).
PulseV2Buffer with exactly one buffered axis is semantically equivalent
to v1's Delay: same lookback (= overlap) state, same per-pulse output
(`input + lookback`), same zero-init at session start. Lowering at
declutter time gives us the in-place memmove fast path, NNEF
serialisation round-trip, and all the v1-era plumbing for free.

`PulseV2Buffer::declutter` checks `buffered_axis()`:
- exactly one buffered axis: replace with `Delay::new_typed(input_fact,
  axis, delay=0, overlap=lookback)`.
- zero buffered axes: shunt entirely (it's an identity wrapper).
- multi-axis lookback: keep PulseV2Buffer (no multi-axis Delay yet).

Wavenet topology comparison after this and the previous PulseV2Slice
shunt-on-declutter:

  | op       | v1 | v2 (pre-declutter) | v2 (post-declutter) |
  | EinSum   | 49 |                49  |                 49  |
  | Conv     | 49 |                49  |                 49  |
  | Add      | 47 |                47  |                 47  |
  | Mul      | 25 |                25  |                 25  |
  | Delay    | 25 |                 0  |                 25  |
  | Tanh     | 24 |                24  |                 24  |
  | Sigmoid  | 24 |                24  |                 24  |
  | AddN     | 23 |                23  |                 23  |
  | Relu     |  2 |                 2  |                  2  |
  | Max      |  2 |                 2  |                  2  |
  | (markers)|  0 | 49 PulseV2Buffer +  46 PulseV2Slice |   0 |

v1 and v2 (post-declutter) are now op-for-op identical on hey_snips.

7 v2 proptests still pass; hey_snips numerical match unchanged.
Under fixed-pulse semantics there are no negative streaming-axis dims to
clamp:
- PulseV2Buffer's output is `input + lookback` (constant positive).
- Conv-after-Buffer's output is `(input + lookback - (K-1)·D) / stride`
  which stays ≥ 0 by construction (lookback = (K-1)·D).
- No `min(P, max(0, …))` symbolic ramps anywhere.

So the function never clamps anything in practice; the call site is a
no-op, and the function itself probes for symbol names like `H` and `S`
that we don't generate anymore. Delete both.

The variable-pulse-era `Min(T·P, lookback)` ramp commit
(ab77d8669) and the H-symbol commit (2541a12a3) and the sink-only-clamp
commit (1179a384a) are still in the foundation history (the cherry-pick
chain that built up to fixed-pulse), but their mechanisms are now
genuinely dead. This commit removes the last in-tree reference to them
in v2.rs. v2_pad.rs still uses the variable-pulse formula but that's a
known follow-up (Pad rewrite) — not in this commit's scope.

7 v2 proptests still pass; hey_snips numerical match unchanged; both
blowup tests stay flat (skip_residual_chain_grows scales linearly,
dilated_chain_grows shape is literally `1,1,P` at n=30).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant