Pulse v2 by kali · Pull Request #2167 · sonos/tract

kali · 2026-04-29T07:48:50Z

No description provided.

Introduce the PulseV2 abstraction alongside the existing pulse code. Instead of tracking a single streaming axis with fixed pulse size and startup delay, PulseV2 describes per-axis regions (what data an op produces at pulse T) with start/end as TDim expressions in T and P. For now: AxisRegion, PulseV2Region, PulseV2Fact, and a basic from_batch_input constructor for simple streaming inputs.

Add run_pulse_v2() which naively slices input into pulse-sized chunks, runs the batch model concretized to pulse size, and stitches output. Identity test passes. Conv test fails because the naive slicing doesn't provide the overlap (kernel-1 samples) that the conv needs from the previous pulse. This is the buffering problem PulseV2 needs to solve.

PulseV2Model wraps a TypedModel with per-wire region metadata. PulseV2Model::new() pulsifies a batch model (currently: Source and trivial pass-through only). into_typed() lowers back to TypedModel. Test harness now goes through PulseV2Model instead of ad-hoc concretization. Identity test passes; conv test still fails (no buffering yet).

PulseV2Buffer is a stateful op that stores previous pulse tensors in a ring buffer and stitches them with the current input along a given axis. At T=0, zero-pads the lookback region. PulseV2Model::new() detects Conv ops and inserts PulseV2Buffer on the data input with lookback = (kernel-1) * dilation. Both identity and conv tests pass. Conv test: kernel=3, pulse=4, input=8 samples, valid padding. Buffer provides 2 samples of lookback from the previous pulse, conv produces 4 valid frames per pulse (plus 2 startup frames to skip from zero padding).

PulseV2Buffer output_facts uses Min(T*pulse, lookback) to express the variable lookback: at T=0, no lookback (output = P); at steady state, full lookback (output = P + lookback). The runtime resolves T per pulse. Conv at T=0 produces P-K+1 valid frames, at T≥1 produces P frames. Every frame is valid — no garbage, no startup delay to discard. This is the core PulseV2 contract: variable-size increments, no waste.

Introduce PulsifierV2 trait + inventory::collect! for registering V2 pulsifiers. Each pulsifier is a function that receives a PulseV2Context (batch model, target typed model, mapping, symbols) and wires the pulsed version of an op. Conv pulsifier moved from hardcoded logic in PulseV2Model::new() to a registered pulsifier in v2_conv.rs via register_pulsifier_v2!. PulseV2Model::new() now: Sources → try inventory → default pass-through.

Passing tests: - Identity (Source → output) - Conv kernel=3 pulse=4 (original) - Conv kernel=2 pulse=2 - Conv chain: kernel=2 → kernel=3, pulse=3 Known gaps (ignored): - Conv pulse=1 kernel=3 (P < K: no output at T=0, needs empty-output) - Slice/Crop (batch coordinates applied to pulse-sized tensor, needs region transform)

PulseV2Pad: stateless op that emits before-padding at T=0 alongside data, after-padding at the pulse where (T+1)*P > S. output_facts uses TDim expressions with Eq(T,0), Min, Max — no fresh symbols. TDim fix: Min/Max inclusive_bound used filter_map which silently dropped terms with unknown bounds. For Max upper bound, an unknown term means the result is unknown (not the max of known terms). Same for Min lower bound. This caused Min(1, Max(0, X)) to incorrectly simplify to Max(0, X). Runner: now feeds extra pulses past the input until the batch output length is reached, so after-padding is fully emitted. 6/6 proptests passing: single conv, conv chain, dilation, stride, crop, pad.

Pad + Conv combo works without extra code: the generic pulsifier inserts PulseV2Pad then PulseV2Buffer + Conv automatically. Also added full pad_plus_conv proptest from V1 harness (ignored, needs stride + edge padding support).

MaxPool: shares pool_spec_input_regions with Conv — zero new logic. Downsample + deconv: added run_v2 and full proptests (both ignored). - Downsample fails: stride alignment + runner loop issue - Deconv needs output overlap buffer (new pattern, not just input lookback) 7/7 passing, 4 ignored full proptests.

Buffer output_facts uses a fresh H symbol (0 <= H <= lookback) instead of min(T*P, lookback). H resolves at runtime from the actual tensor size. This fixes the intermediate buffer overestimate. Clamp probes multiple worst cases (T=0/100, P=1/100, H=0/1). S >= P assertion helps TDim simplification. 7/7 focused proptests pass. Full conv proptest hangs — TDim simplification of deeply nested H/Min/Max/Eq expressions is too slow (Mul distribution causes exponential expression growth). Needs either distribution depth limit or simpler expressions.

Per-node clamping wraps every conv output with max(0, …) which gives the TDim simplifier nested Max trees that blow up exponentially on deep conv chains. Intermediate shapes can stay flat: any startup-time negative value is irrelevant inside the graph since streaming-axis allocation happens at the sinks. Clamp only the final pulsed outputs. 30-layer dilated conv chain: 99 s → 0 ms (the latter scales linearly).

…, hey_snips dilated_chain_grows and skip_residual_chain_grows are quick-running synthetic harnesses that print per-iteration time and shape size for PulseV2Model::new on increasingly deep models. Useful to localise TDim simplifier blowup signatures. hey_snips_v2 is the real-model smoke (currently #[ignore]d — hangs at the first skip-Add, where parallel paths with different H_i chains trigger broadcast Max nesting).

Drop the H_i symbolic-history mechanism and the variable-pulse-size contract that came with it. Buffer now emits a constant `input + lookback` on its buffered axis, matching v1's `Delay`: - output_facts: shape on the buffered axis is `input_dim + lookback` (a concrete usize add). No fresh H symbol, no scope assertions. - state: a single history tensor of shape `lookback` on the buffered axis, initialised to zeros at session start. - eval: output = concat(history, input) on the buffered axis; new history = last `lookback` samples of the output. This means at T=0 the conv-after-buffer sees `input + lookback` zeros- prefixed samples and produces a full `(input + lookback - K + 1)/stride` output — fixed-shape ramp with garbage prefix that's discarded at the sink, instead of the variable-size ramp the H_i scheme produced. Single-buffered-axis assumption: the streaming axis is the only one that gets a non-zero lookback. `buffered_axis()` debug-asserts at most one such axis. Multi-axis lookback is left for a future revision. The `depth` field is gone; the ring-buffer-of-tensors representation goes away with it. Also: harness `run_and_compare_v2` now computes `total_delay` (sum of buffer lookbacks on the streaming axis) and skips that many samples from the front of the pulsed output before comparing — those samples are zero-prefill garbage. Run loop bumped to drain `batch_len + total_delay` total samples. Net on the focused proptests: - single_conv, conv_chain, conv_dilation, crop, pad, pad_plus_conv: 6/7 pass. - conv_stride: fails on stride > 1 because total_delay is naive (sums raw lookbacks without dividing by stride). Stride-aware delay accounting is a follow-up.

Each `PulseV2Buffer` pre-fills `lookback` zeros on the input axis. After a `Conv(stride=s)`, those garbage samples land on the output axis as `lookback / s` zero-prefilled output positions (because stride compresses the output). Generalise to chained ops by walking the pulsed model: - accumulate `total_lookback` from every PulseV2Buffer - accumulate `total_stride` as the product of streaming-axis strides on every Conv node - output-axis delay = `total_lookback / total_stride` Holds for the proptest models (single linear streaming path with conv + buffers + pad + crop). Real graphs with branching would need per-buffer downstream-stride accounting; flagging that as a follow-up in the comment. `proptest_v2_conv_stride` recovers; the v2 focused suite is back to 7/7.

`into_optimized()` decomposes Conv into Im2col + OptMatMul and inserts AxisOp wrappers that pulse-v2's RegionTransform inventory doesn't handle — those ops are concrete-allocation kernels, not the ops we want to pulsify against. Drop the call. Still `#[ignore]`'d — the new fixed-pulse Buffer alone doesn't unblock hey_snips because PulseV2Slice still emits variable-pulse output that differs syntactically from the conv side's constant P, so broadcast at the skip-Add still walks into the Max chain blowup. The ignore message is updated to point at the actual remaining gap (Slice rewrite).

Same approach as v1's PulsedAxisSlice: on the streaming axis, Slice is metadata bookkeeping. The runtime tensor passes through unchanged (constant `P` per pulse); the slice's `start` is the v2-flavoured equivalent of v1's `delay`, tracked at sink/merge time rather than as a per-pulse shape. - `PulseV2Slice::output_facts` returns input fact unchanged. - `PulseV2Slice::eval_with_session` returns input as-is. - The `start` and `end` fields are kept on the op for downstream inspection (the harness sums slice starts on the streaming axis into its `total_delay` so batch-vs-pulsed comparison realigns correctly). Net effect on the wavenet smoke test: `hey_snips_pulsify_v2` now walks the WaveNet graph end-to-end in 0.16 s. The original blowup at chained skip-Adds is gone — both sides arrive at the Add with the same constant-`P` declared shape, broadcast unifies trivially, no Max-tree nesting. Un-ignored. Pad on the streaming axis is left as-is for now; wavenet's typed- decluttered form doesn't have any after `decompose_streaming_padding` splits Conv-with-padding into Pad on non-streaming + Conv valid. A similar fixed-pulse rewrite for PulseV2Pad will be needed once a model actually uses streaming-axis padding. 7 v2 proptests still pass (proptest_v2_crop exercises the new Slice pass-through).

Replace the smoke-only test with a real correctness check: run the batch model and the pulsified v2 model on the same random input, collect per-pulse outputs, and assert that the steady-state tail of the pulsed output matches batch within float tolerance. The trick to avoid worrying about cumulative delay: run pulses just long enough to consume the full input stream (`stream_len.div_ceil(pulse)` pulses); the tail `batch_len` samples of the pulsed output are aligned with the full batch output because the pulsed runtime emits a fixed size per pulse (post-buffer/post-slice) and the stream is fully consumed by the last pulse. Wavenet inventory after pulsification: buffers=49 convs=49 slices=46. Pulsed output (256) — batch output (74) = 182 = wavenet's effective receptive field on the streaming axis. Comparison: 74/74 samples within `close_enough` tolerance. This is the wavenet milestone: v2 produces numerically the same output as the unpulsified model, on the historical test case that v1 was originally designed for. v2 generalises v1 and fixed-pulse streaming-axis semantics.

Mirrors the existing `--pulse [PULSE]` block: a new `--pulse-v2 [SYM]` flag that takes the streaming axis symbol name (default "S") and registers a `pulse-v2` stage that runs `PulseV2Model::new` and `into_typed()`, plus a passthrough `pulse-v2-to-type` stage so users can target the lowered form via `--pass`. Lower-side only: the run / compare side (per-pulse driver, P/T/S symbol resolution at runtime, output stitching) still needs to be plumbed; today only the dump path is useful for inspection.

PulseV2Slice on the streaming axis is a no-op at runtime — its `start` and `end` are just metadata for consumers walking the graph during pulsification. Once the model is declutter-ready, the marker has done its job. Implement `TypedOp::declutter` as `shunt_one_op`, mirroring `Identity`. Also fix `slice_transform` to skip the replacement entirely when the slice's axis isn't the streaming axis (the original `Slice` runs correctly there — no need for our pass-through wrapper). Add a `pulse-v2-declutter` CLI stage so `--pass pulse-v2-declutter` gives the same lowered form `pulse-declutter` does for v1. Net effect on the wavenet topology comparison: | op | v1 | v2 pre-declutter | v2 post-declutter | | streaming-state ops | 25| 49 | 25 | | streaming-axis Slice markers | 0| 46 | 0 | | EinSum/Conv/Add/Mul/Tanh/... | identical across all three | Post-declutter, v2 reaches the same op-count topology as v1: 25 state- bearing streaming ops, no pass-through markers. The decluttering also collapsed 49 PulseV2Buffer down to 25 — the surrounding declutter rules deduplicate buffers that share the same upstream source (presumably the fan-out predecessors that wavenet's residual stacks share). The `pulse-v2-declutter` stage uses the standard `into_decluttered()` pass — the magic is just that `PulseV2Slice::declutter` now signals "shunt me" so the rest of the declutter pipeline can fold around it. Test status: 7 v2 proptests still pass; hey_snips numerical match unchanged; pre-declutter graph still has the 46 Slice markers (so external code that wants to read `start`/`end` metadata still can).

PulseV2Buffer with exactly one buffered axis is semantically equivalent to v1's Delay: same lookback (= overlap) state, same per-pulse output (`input + lookback`), same zero-init at session start. Lowering at declutter time gives us the in-place memmove fast path, NNEF serialisation round-trip, and all the v1-era plumbing for free. `PulseV2Buffer::declutter` checks `buffered_axis()`: - exactly one buffered axis: replace with `Delay::new_typed(input_fact, axis, delay=0, overlap=lookback)`. - zero buffered axes: shunt entirely (it's an identity wrapper). - multi-axis lookback: keep PulseV2Buffer (no multi-axis Delay yet). Wavenet topology comparison after this and the previous PulseV2Slice shunt-on-declutter: | op | v1 | v2 (pre-declutter) | v2 (post-declutter) | | EinSum | 49 | 49 | 49 | | Conv | 49 | 49 | 49 | | Add | 47 | 47 | 47 | | Mul | 25 | 25 | 25 | | Delay | 25 | 0 | 25 | | Tanh | 24 | 24 | 24 | | Sigmoid | 24 | 24 | 24 | | AddN | 23 | 23 | 23 | | Relu | 2 | 2 | 2 | | Max | 2 | 2 | 2 | | (markers)| 0 | 49 PulseV2Buffer + 46 PulseV2Slice | 0 | v1 and v2 (post-declutter) are now op-for-op identical on hey_snips. 7 v2 proptests still pass; hey_snips numerical match unchanged.

Under fixed-pulse semantics there are no negative streaming-axis dims to clamp: - PulseV2Buffer's output is `input + lookback` (constant positive). - Conv-after-Buffer's output is `(input + lookback - (K-1)·D) / stride` which stays ≥ 0 by construction (lookback = (K-1)·D). - No `min(P, max(0, …))` symbolic ramps anywhere. So the function never clamps anything in practice; the call site is a no-op, and the function itself probes for symbol names like `H` and `S` that we don't generate anymore. Delete both. The variable-pulse-era `Min(T·P, lookback)` ramp commit (ab77d8669) and the H-symbol commit (2541a12a3) and the sink-only-clamp commit (1179a384a) are still in the foundation history (the cherry-pick chain that built up to fixed-pulse), but their mechanisms are now genuinely dead. This commit removes the last in-tree reference to them in v2.rs. v2_pad.rs still uses the variable-pulse formula but that's a known follow-up (Pad rewrite) — not in this commit's scope. 7 v2 proptests still pass; hey_snips numerical match unchanged; both blowup tests stay flat (skip_residual_chain_grows scales linearly, dilated_chain_grows shape is literally `1,1,P` at n=30).

kali added 22 commits April 29, 2026 07:45

PulseV2: pad+conv proptest passes — 7/7 proptests green

2efbfc1

Pad + Conv combo works without extra code: the generic pulsifier inserts PulseV2Pad then PulseV2Buffer + Conv automatically. Also added full pad_plus_conv proptest from V1 harness (ignored, needs stride + edge padding support).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulse v2#2167

Pulse v2#2167
kali wants to merge 22 commits intomainfrom
pulse-v2

kali commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kali commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant