Encoder pulsification by kali · Pull Request #2097 · sonos/tract

kali · 2026-04-02T14:02:09Z

No description provided.

…timised away ChangeAxes legitimately squeezes singleton batch/stream dims from intermediate EinSum outputs during pulse-declutter. When that happens, slicing along the expected output_axis no longer makes sense. Add a streaming_axis_ok guard in handle_stream that skips such nodes instead of crashing or producing garbage. Also add harness/sdpa-pulse/block-l-eq-p: block-diagonal bidirectional attention (L=P=2, left_chunks=0) as a first pulsification smoke-test. Uses --allow-random-input so both handle_stream and compare() share the same fixed-seed input.

When a slice has symbolic begin/end (e.g. end=S on a [S+1,...] tensor), the min()-clamping graph ops in the NNEF slice deserializer produced min(S,S+1)-min(0,S+1) for the DynSlice `len` instead of S. This caused concat shape mismatches at model-build and declutter time. Fix: extract begin[ix]/end[ix] as TDim directly from the original Const tensors (handling i32, i64, and TDim datum types) and use those as the primary `len` computation, bypassing the clamping imprecision. Also replace the b/e outlets with fresh simplified Const nodes so DynSlice declutter sees the correct values. Add harness/sdpa-pulse/block-left-1: block attention with left_chunks=1. Each chunk c attends to its own tokens and the previous chunk's tokens (k_prev via pad(before=1)+slice(end=S) which pulsifies to Delay(1,0)). Streaming compare passes: 16 nodes.

The slice deserializer's min()-clamping sub-graph couldn't prove min(0, S+1) = 0 because S had no lower-bound constraint. Adding `extension tract_assert S>=0` to the NNEF gives the TDim prover the information it needs, making the workaround unnecessary. Revert the get_tdim_at / begin_ix_tdim / end_ix_tdim patch from nnef/src/ops/nnef/deser.rs and keep the fix where it belongs: in the NNEF file that introduces the symbol.

block-l-eq-p-mask: block-diagonal attention with an external all-true boolean mask fed through select+softmax. Tests that Iff+softmax pulsifies correctly. Batch passes; streaming compare is blocked by handle_stream only supporting single-input models. block-left-1-mask: flat-token sliding-window attention (T=8, P=2, left_chunks=1, Dh=4) with the attention mask computed entirely inside the graph, adapted from the real Nemotron encoder mask construction: range -> cast -> div/floor -> cast -> unsqueeze -> sub -> le/ge -> and The 'length' external input is replaced by shape_of(qkv)[0]. No padMask. Batch reference passes. Streaming compare fails with "Undetermined symbol (range)#N": in the pulsified model T=shape_of(qkv)[0] remains the streaming symbol S, Range creates a fresh symbol for its output length that the pulsifier can't resolve. This is the expected failure that drives the next implementation step: FoldUniformMask needs to bound the attention window before pulsification, so that the Range sub-graph in the mask computation is eliminated and the K/V lookback is bounded.

…lsifiers FoldWindowAttention (new optimizer pass, fires in declutter before FoldUniformMask) detects the pattern: Iff(chunk_window_mask, scores) → Softmax → EinSum(v) where the condition wire carries a uniform_tdim encoding a 2-D chunk-window predicate (0 ≤ floor(i/P) − floor(j/P) ≤ L), and rewrites it to bounded- window chunk-layout attention equivalent to block-left-L: q_c [C,P,D] = reshape(q) k_ctx [C,(L+1)P,D] = concat of lagged copies via Pad+Slice scores_c = einsum("cpd,cld->cpl", q_c, k_ctx) attn_c = softmax(scores_c, axes=[2]) out_c = einsum("cpl,cld->cpd", attn_c, v_ctx) output = reshape(out_c) This eliminates the T×T attention and the Range/mask subgraph before pulsification, replacing them with a bounded K/V context the pulsifier can handle via Delay ops. classify_chunk_window (core/src/ops/logic.rs) recognises the TDim expression produced by uniform_tdim propagation through floor(i/P)−floor(j/P) ≤ L. PulsedTokenFold / PulsedTokenUnfold (pulse/src/ops/array/reshape.rs) are new pulsifiers for AxisOp::Reshape registered via the OpPulsifier inventory: - fold [T,D]→[C,P,D], pulse=P → Add(0) at pulse time, stream axis=0 dim=C - unfold [C,P,D]→[T,D], pulse=1 → Rm(0) at pulse time, stream axis=0 dim=T change_shape_array gains a fallback branch for symbolically-equivalent-but- structurally-different from/to specs (e.g. S vs P*(S/P)). harness/sdpa-pulse/block-left-1-mask now passes batch and streaming (20/20). gen-inputs.py updated to use zero-padded K/V context matching FoldWindowAttention semantics (startup chunk attended by zeros, consistent with Delay discarding). REVISIT.md documents five items to generalise before the real encoder lands.

One diagram per harness stage (batch + pulsed where applicable). block-left-1-mask gets three: original batch, after FoldWindowAttention (mask subgraph eliminated), and final pulsed form.

…ation infrastructure Adds `ex05-block-left-1-posenc`: flat-token windowed attention (left_chunks=1, P=2) with an ALiBi relative-position bias added to scores before masking. FoldWindowAttention does not fire (Iff true-branch is Add(EinSum, pos_bias), not a bare EinSum), so this exercises a new ROI-based pulsification path. New optimizer passes and ops: - `UniformTDim` op (core): materialise a TDim boolean expression as a tensor at runtime; used for the chunk-window mask. - `FoldUniformTDim` pass (core): replace producer subgraphs of bool wires that carry `uniform_tdim` metadata with a single `UniformTDim` node. Only bool wires are folded; f32 wires retain `uniform_tdim` as metadata for later passes. - `PropagateRoi` extended: now walks backward through `TypedBinOp` chains, annotating every wire with `region_of_interest`. This lets the binary pulsifier and the ROI-aware EinSum pulsifier fire for the pos-bias subgraph. New pulsifiers (pulse crate): - `binary.rs` — `TypedBinOp` pulsifier: when a wire has both ROI and `uniform_tdim` (e.g. `rel_pos = sub(row_pos, col_pos)`), evaluates the integer coordinate expression at window-local positions to produce a constant `[P, (L+1)*P]` tensor. - `mask.rs` — `Iff` pulsifier: when the condition wire carries a chunk-window `uniform_tdim`, elides the Iff and passes the true branch through directly. This avoids a shape-inference problem where the fill's symbolic `[S,S]` MultiBroadcastTo would cause downstream ops to inherit an unresolvable shape. - `einsum.rs` — two-case EinSum pulsifier: - **QK case** (existing, improved): uses `PulseWrappingOp` so Q's streaming axis propagates to the scores output, enabling downstream ops to stream correctly. - **AV case** (new): detects the AV pattern (streaming V on a contracted axis), applies `Delay(overlap=L*P)` to V, then uses `PulseWrappingOp` for the output. - `uniform_tdim.rs` — `UniformTDim` pulsifier: produces an all-true `[P, (L+1)*P]` constant (within the ROI window the mask is always satisfied). - `mod.rs` — registers `binary` and `mask` modules in `register_all_mod!`. Harness reorganisation: renamed `block-*` → `ex0N-block-*` for natural ordering. Streaming comparison for ex05 is deferred (requires FoldWindowAttention to handle the biased-scores pattern and switch to zero-padded startup semantics).

FoldWindowAttention detected Iff(chunk_window_mask)→Softmax→EinSum and rewrote the T×T attention to bounded-window chunk-layout before pulsification. This mixed attention-universe rewrites with pulse-universe machinery and was unnecessary: ex04 and ex05 already stream correctly via ChunkWindowMask + binary pulsifier acting on the flat attention graph directly. Changes: - Delete core/src/optim/fold_window_attention.rs and remove from declutter() - compare --stream: round stream_dim up to multiple of input_pulse to avoid partial NaN-padded last pulse propagating through K_ctx - compare --stream: skip structurally incompatible intermediates (yellow) rather than failing when pulsed shapes differ from reference (e.g. [P,kw] vs [S,S] for windowed attention) - delay.rs: use zero_dt for startup buffer (NaN workaround; root cause documented in REVISIT item 8: pulsify_qk does not propagate K delay) - ex04 gen-inputs.py: switch to -inf masking (natural batch graph semantics), removing the zero-padded startup semantics that matched FoldWindowAttention - ex05 ci.sh/gen-inputs.py/graph.nnef: remove stale FoldWindowAttention notes - README/REVISIT/doc d2: update to reflect current approach All four streaming harnesses pass: ex01, ex03, ex04, ex05.

ChunkWindowMask always produced a rank-2 [P, key_window] mask, but Iff.output_facts requires all three inputs to have identical rank. For rank-4 attention [B, H, P, kw], the mask and fill were rank-2, causing a rank-mismatch error at pulsification time. Fix (pulse/src/ops/mask.rs): after wiring ChunkWindowMask, add NonPulsingWrappingOp(AxisOp::Add(0)) for each leading dimension needed to promote the mask from [P,kw] to [1,...,1,P,kw]. Update the fill tensor shape from [1,1] to [1;true_rank] for the same reason. The EinSum pulsifiers (pulsify_qk, pulsify_av) already handle arbitrary ranks — they use the axis indices from classify_chunk_window (which extracts the actual row/col axes from the uniform_tdim coord symbols after remapping through unsqueeze) and the streaming fact axis fields. New harness ex06-batch-multihead: B=1, H=2, T=8, P=2, left_chunks=1, Dh=4; qkv [1,2,S,12] streaming on axis 2. Batch and streaming both pass.

…ssions Add ex07-block-left-1-chunkpos harness: sliding-window attention with a chunk-level relative-position bias v_bias[i,j] = slope * (floor(i/P) - floor(j/P)) slope = -0.5 This is the Transformer-XL v-bias concept (constant additive position term) expressed as a direct arithmetic formula. At pulse time the `chunk_diff` wire carries uniform_tdim = Div(🎯0, 2) - Div(🎯1, 2) which the binary pulsifier evaluates at steady-state coordinates, exercising integer-division inside a TDim coordinate expression — new machinery not covered by ex05's linear (i-j) bias. Also add REVISIT item 9 documenting what would be needed to handle the full Transformer-XL Q@R content-to-position term (Pad+Reshape+Slice skew currently breaks uniform_tdim propagation).

…rces and multi-axis pad MultiBroadcastTo pulsifier was computing per-pulse output size by substituting S→pulse directly into op.shape. For shapes derived from shape_of(strided conv), e.g. 1+S/2, this yielded 1+P/2 instead of the correct P/2 — one frame too many. Root cause: the batch formula includes a constant boundary term (frames produced before any real input arrives due to pre-padding) that should not contribute to the per-pulse slot. Fix: use substitute(S→P) - substitute(S→0) on the streaming axis, which strips the constant offset and leaves only the linear per-pulse part. shape_of(stride-2 conv) = 1 + S/2 old: substitute(S→4) = 3 (wrong) new: substitute(S→4) - sub(S→0) = 2 (correct) Also included: - pulse/source.rs: allow non-streaming model inputs (e.g. a "length" scalar) to pass through pulsification as static sources, provided at least one other model input does carry the streaming symbol. - pulse/pad.rs: support Pad ops that have constant padding on non-streaming axes in addition to streaming-axis padding (apply PulsePad for the streaming axis, then a plain Pad for the remaining axes). Regression test: harness/nnef-test-cases/conv-then-shape-of-mask reproduces the Nemotron encoder pattern (stride-2 conv + tract_core_broadcast with batch-formula shape), and verifies that batch and streaming modes agree.

…eam symbol When a static table (e.g. a PE table [9999,D]) is sliced to a length derived from the streaming symbol — e.g. Slice{axis=0, start=0, end=S} — the Slice pulsifier previously crashed with "Unexpected streamless fact". Fix: detect non-streaming input, substitute S→pulse in start/end, and wire a concrete-size Slice via NonPulsingWrappingOp. The output is static per pulse (same PE entries every chunk), and the downstream binary op is handled by PulseWrappingOp. Add harness/nnef-test-cases/slice-of-static-with-streaming-size as a regression test covering both batch and streaming compare modes.

…/O, uniform_tdim propagation Key changes: core/ops/logic.rs: classify_chunk_window now tolerates extra Mul factors (e.g. padding-validity conditions ANDed in); searches all ordered pairs. core/ops/binary.rs: And propagates chunk-window uniform_tdim from one operand when the other has none (padding mask whose chain broke at the audio-length scalar). data/dim/sym.rs: coord symbols (🎯k) are always non-negative; proves positivity without a known bound. pulse/ops/binary.rs (Bool pulsifier): scan input outlets for chunk-window uniform_tdim when the stored output fact has none (FoldUniformTDim creates UniformTDim nodes but does not re-propagate uniform_tdim to downstream stored facts). Use unwrap_or(1) for batch/undetermined symbols in non-window axes. pulse/ops/mask.rs (Iff pulsifier): same fallback — scan inner_outlet's inputs when inner_fact.uniform_tdim is None. pulse/src/model.rs (into_typed): relax "all inputs/outputs streaming" to "at least one streaming", allowing non-streaming auxiliary inputs (e.g. sequence-length scalar) and outputs (e.g. encoded_lengths). Use -1 sentinel axis for non-streaming I/O; use 0 delay for non-streaming outputs. pulse/ops/array/reshape.rs, slice.rs, uniform_tdim.rs: formatting only.

…s tests ex08: adds batch dim to ex04's flat T×T masked attention (scores [B,S,S]). Passes — establishes that PropagateRoi works through the batch axis. ex09: adds head dimension H=2 (scores [B,H,S,S], mask broadcast [B,1,S,S]). Passes — batch and head axes are transparent to ROI propagation. ex10: inverted Iff convention: select(~window_mask, -inf, scores). Scores are at inputs[2] (false-branch), not inputs[1] (true-branch). Batch run passes; streaming fails, exposing two gaps that must be fixed before the encoder pulsifies: 1. PropagateRoi annotates only inputs[1]; must also handle inputs[2]. 2. UniformTDim pulsifier cannot handle `1 + -1*cw` (negated expr).

…vention Some models (e.g. Nemotron encoder) use select(~window_mask, fill, scores) where condition=True means "masked out". Scores are in the false-branch (inputs[2]) rather than the true-branch (inputs[1]). The condition's uniform_tdim is `1 + -1*cw` (NOT of the window mask). Add classify_negated_chunk_window() in logic.rs to detect this form, then use it in PropagateRoi.run_direct to annotate inputs[2] instead of inputs[1] when the inverted convention is detected. Unit test added for the inverted case.

Some models (e.g. Nemotron encoder) use the inverted masking convention select(~window_mask, fill, scores) where condition=True means "masked out" and scores are in the false-branch. Three operator-local fixes make this pulsify correctly: 1. logic.rs: add peel_negated_chunk_window_expr() to extract the positive chunk-window TDim from a negated expression `1 + -1*cw`. Add classify_negated_chunk_window() built on top of it. 2. propagate_roi.rs: detect the inverted convention by checking whether the condition's uniform_tdim is a negated chunk-window. Annotate inputs[2] (false-branch = scores) with the *positive* CW expression as the ROI, so the QK EinSum pulsifier correctly inserts a windowed Delay on K. 3. uniform_tdim.rs: handle negated CW expressions by emitting an all-False constant mask (in-window → False for the inverted convention). 4. mask.rs (Iff pulsifier): FoldUniformTDim may replace not(window_mask) with a UniformTDim carrying the negated expression, so peel_condition() may not detect the inversion. After classify_chunk_window fails on the inner expression, try classify_negated_chunk_window and XOR inverted. Also extend peel_condition to recognise ElementWiseOp(Not) (NNEF `not`) in addition to ElementWiseOp(BitNot). ex10-batch-multihead-projections now passes both batch run and compare --stream.

Add TypedOp::input_roi() — a per-input ROI annotation method that lets ops declare which of their inputs should receive a chunk-window region_of_interest derived from another input's uniform_tdim. ScaledMaskedSoftmax implements input_roi: reads the mask input's uniform_tdim (handling both standard and negated CW expressions) and returns Some(roi) for the scores input, None for the mask input. ScaledMaskedSoftmax also implements axes_mapping returning AxesMapping::natural, enabling the generic PulseWrappingOp fallback to track the streaming row axis through the op. PropagateRoi::run_direct gains a second sub-loop that calls op.input_roi(...) on every non-Iff node and propagates ROI up through any Some(roi) slot. Fix UniformTDim pulsifier for left_chunks > 0: instead of a constant all-True tensor (which incorrectly treats zero-padded K positions as valid during startup), emit a ChunkWindowMask stateful op followed by AxisOp::Reshape to restore leading singleton dims. This makes the mask correct at each chunk for both Iff-based and SMS-based attention. Add ex11-batch-scaled-masked-softmax harness: exercises the full SMS pulsification path (PropagateRoi via input_roi → pulsify_qk K delay → SMS via PulseWrappingOp → pulsify_av V delay). REVISIT item 10: unify Iff-specific PropagateRoi loop into Iff::input_roi.

…left_chunks>0) PropagateRoi now propagates the chunk-window ROI backward through the full skew-trick pos_scores chain (Add→Slice→Reshape→Slice→Reshape→Pad→EinSum→R). Each operator contributes a TypedOp::input_roi hook: - Slice, Pad, DynSlice: pass output ROI to input[0] - AxisOp::Reshape: pass ROI with row/col axes swapped when the reshape touches both the row and col axes (the skew step) - EinSum: annotate the key/position input (has col axis, not row axis) The Slice pulsifier extends each slice's range by L*P in the direction that matches the semantic role of the slice: - Fixed-start slices (pos_sliced, pos_scores): add L*P to end - Center-anchored slices (R extraction: start=center-S): subtract L*P from start Verified: all 13 sdpa-pulse harness examples pass; propagate_roi unit tests pass. Adds ex13-rel-pos-skew-window harness example (left_chunks=1, W=4, P=2).

…unds Range pulsifier: fire when all inputs are non-streaming (static start/end/step) but the output length contains the streaming symbol. Wraps in NonPulsingWrappingOp so the range tensor is re-evaluated each pulse from the concrete per-pulse bounds. nnef range_load: compute symbolic length from constant start/end/step when possible, preserving the expression (e.g. T_tokens) instead of a fresh free symbol. This prevents shape contamination during pulsification. Also adds ex12-rel-pos-skew harness example (RPE skew trick, left_chunks=0).

…e facts forward pass Three cooperating fixes enable the Nemotron encoder to pulsify correctly: 1. **DynSlice::output_facts**: propagate `uniform_tdim` from the input when `begin=0`. The mask computation chain passes through a DynSlice (e.g. attMaskSlice0), and `without_value()` was zeroing out `uniform_tdim`. With begin=0 the slice result coordinates are identical to the source coordinates, so the predicate stays valid. 2. **PropagateRoi forward pass**: before the Iff/input_roi annotation loops, walk the graph in topological order and re-run `output_facts` for any node whose output lacks `uniform_tdim` but whose input has it. This fixes stale facts left behind when patch-based passes (FoldUniformTDim) shunt a subgraph to a UniformTDim node: downstream nodes (Cast, And, BitNot, AddAxis) keep their old facts; this pass refreshes them so PropagateRoi sees the correct uniform_tdim on every wire. 3. **UniformTDim pulsifier**: change the output shape from `Vec<usize>` to `TVec<TDim>` so that symbolic batch dimensions (e.g. BATCH) are carried through as TDim::Sym rather than failing with "Undetermined symbol". The ChunkWindowMask reshape uses the TDim shape directly; the constant-tensor fallback evaluates symbolic dims to 1. Also: ScaledMaskedSoftmax::input_roi now builds a properly classified chunk-window ROI (using build_chunk_window_roi) rather than returning the raw TDim expression, so the ROI axes are shifted correctly when scores have extra leading (heads) dimensions. Harness: nemotron encoder pulsification test added (pulse=112 = 14 chunks × 8x stride).

- Iff::input_roi: detect inverted convention (select(~mask, fill, scores)) and annotate inputs[2] with the positive ROI expression instead of always annotating inputs[1] - ScaledMaskedSoftmax: add axes_mapping (AxesMapping::natural) so the generic pulsifier can track the streaming axis through the op - Slice/DynSlice pulsifiers: guard ROI-driven range extension against the input's actual size — if the upstream chain hasn't been expanded, fall through to the non-ROI path instead of producing out-of-bounds slices - Remove duplicate old-signature input_roi implementations left over from the rebase (binary, pad, slice, change_axes, einsum, dyn_slice)

…t supported)

When the output wire has ROI but no uniform_tdim (e.g. Mul(rel_pos, -0.125) where the float scaling breaks TDim propagation), walk upstream through scalar-constant TypedBinOp nodes to find the nearest uniform_tdim. After evaluating the integer coordinate expression, replay the scalar ops (Mul, Add, etc.) in forward order to recover the actual output values including any float scaling. Fixes ex05 (ALiBi position bias) and ex07 (chunk-level position bias) which regressed after rebasing onto main's systematic ROI propagation.

After ROI bubbles through Pad/Reshape, coordinate symbols in the chunk-window expression may be offset (e.g. Div(🎯k+1, P) instead of Div(🎯k, P)) and extra constant terms may appear in the diff. Generalize extract_div_diff_axes to: - Accept Div(Add(Sym, Val), P) in addition to Div(Sym, P) - Ignore Val(_) constant terms in the Add The offsets don't change P, L, or the axis assignment — they're positional shifts from upstream coordinate transforms. Fixes ex13 (skew trick with left_chunks=1) which failed because the EinSum's input_roi couldn't classify the shifted ROI expression and didn't propagate ROI to the R position table input.

When the K/R input to a QK EinSum is non-streaming, detect if it's a computable constant (via try_compute_const for shared upstream chains) and pre-slice it for the symmetric RPE case (feeds_into_nonlinear). This restores the pre-rebase behavior where pulsify_qk pre-sliced the r_pos table to [W+P-1, Dh] centered at the zero-relative-position entry. For non-const, non-streaming K (e.g. DynSlice of r_full depending on S), gracefully decline and let the generic pulsifier + ROI-aware DynSlice/Slice pulsifiers handle it. Fixes ex14-reduced-skew and ex14-rel-pos-skew-large-table.

…le skew)

When r_pos_proj can't be evaluated directly (symbolic shapes from streaming symbol), substitute the streaming symbol with the pulse value and retry. This handles the encoder pattern where posEnc[0:T] @ W_pos has T depending on the streaming symbol. Also tighten the non-streaming K input detection: require that the matched input maps to the output col_axis but NOT the row_axis (same criterion as EinSum::input_roi). This prevents matching V, biases, or weight tensors that happen to have the col axis.

Slice only implements eval_with_session (uses session.resolved_symbols to evaluate start/end). The plain eval returns "stateless evaluation not implemented". After concretize_dims produces concrete TDim::Val bounds, eval_with_session works with an empty TurnState. Also use translate_model_with_mappings in try_compute_const_with_substitution to correctly map outlet IDs when concretize_dims changes node topology.

…and skew trick

…umers) Reproduces the encoder pattern: DynSlice on posEnc with symbolic bounds, concretize_symbols(BATCH=1) before pulse, and shared posEnc across multiple linearPos EinSums. Currently passes at small scale (T=8, D=8). The encoder's failure at full scale (T=589, D=1024) is under investigation — pulsify_qk's non-streaming K path isn't being reached for the encoder's pos_raw EinSums despite correct ROI annotations.

…mall When try_compute_const_with_substitution evaluates the posEnc chain with one pulse worth of streaming symbol, the resulting symmetric RPE table may be too small for the key window (t_max < key_window). Re-evaluate with a larger symbol value (key_window * pulse) to get enough rows for the pre-slice. This handles the encoder pattern where posEnc[center-T:center+T-1] produces 2*T-1 rows and T at one pulse is much smaller than the needed window.

With left_chunks=5 (W=12), the key window exceeds the skew trick's T=P-based intermediate shapes. The Slice pulsifiers can't extend because the input is too narrow — same failure as the encoder. ci-failing.sh uses left_chunks=5 (fails: broadcast 12 against 2). The graph.nnef has left_chunks=5 for the failing case; the passing version (left_chunks=1) can be restored in a ci.sh alongside.

DynSlice RPE + skew trick + BATCH + concretize_symbols + P=4/pulse=4. Broadcast error: W=16 against P=4 at Add(content_scores, pos_scores). Works with P=2/pulse=2 but fails at P=4/pulse=4. Same pattern as the encoder (P=14, left_chunks=5, W=84).

…nearPos chain

Restore AV EinSum (output must be linear in S, not quadratic). Increase T to 16 so T/P = 4 ≥ L+1 = 4. Enlarge r_full to [31, 4] (= 2*T-1). Both batch and pulsed runs pass. Rename ci-failing.sh to ci.sh since this is now a passing test.

Add stride-2 subsampling so the token dimension T = S/2 differs from the audio streaming symbol S. Pulse on S with pulse=4 gives T=2 tokens per pulse, but W=(3+1)*2=8. The skew trick Reshapes use T=2, producing intermediates too narrow for W=8. Broadcast error: 8 against 2. Same failure as encoder.p1: AUDIO_SIGNAL__TIME/8 ≠ AUDIO_SIGNAL__TIME. Also fix panic in PropagateRoi when a node doesn't implement TypedOp.

Both --set (CLI flag) and concretize_symbols (transform config) now accept any TDim expression as a value, not just integers. --set T=S/2 # symbolic substitution --set BATCH=1 # integer (backward compat) -t 'concretize_symbols(values: {"BATCH": 1})' # RON integer -t 'concretize_symbols(values: {"T": "S/2"})' # RON string Integer values go through the existing SymbolValues/concretize_dims path. Symbolic TDim values are applied first via TDim::substitute on all model facts, then the integer values are concretized. SymbolValues gains set_tdim/get_tdim/tdim_iter for symbolic entries. ConcretizeSymbolsConfig uses a custom deserializer (StringOrInt) to accept both RON integers and strings in the values map.

This reverts commit 44c7ed6.

The relative-position "skew trick" (Pad→Reshape→Slice→Reshape→Slice) converts relative position scores [T, 2T-1] to absolute [T, T] via intermediate reshapes whose shapes depend on the full sequence length T. These intermediates cannot be pulsified individually because they create whole-sequence dependencies. The composed mapping is a clean diagonal gather: pos_scores[i, k] = pos_raw[i, (T-1) + k - i] DiagGather captures this as a single TypedOp in the pulse crate. A pre-pulsification fold pass matches the 5-op chain and replaces it; the pulsifier then trivially produces [P, W] output per pulse using offset = P_local - 1. Also in this commit: - --set accepts TDim expressions (e.g. --set S=2*s) for symbol substitution - --set runs before --pulse so substitutions take effect pre-pulsification - TDim::eval checks tdim_values for symbolic substitutions - PropagateRoi: simplify ROI demands, skip trivial ROI=1 - Pre-flight check warns about superlinear wires missing ROI - PulsedFact evaluates streaming dim expression with pulse symbol - ex16 test case: double stride-2 subsampling + skew trick - README: pulsification semantics (increment, hull, contract, ROI, delay)

End-to-end streaming demo for nvidia/nemotron-speech-streaming-en-0.6b: - Mic thread sends 80-sample chunks (5ms) simulating real-time audio - Main thread accumulates audio, runs preprocessor, feeds encoder pulses (112 audio frames = 14 transformer tokens), then greedy RNNT decode - Progressive transcript output with model timing tags - Encoder uses DiagGather for the skew trick, pulsified at 112 frames

1. Buffer initialization: CPU Delay uses zero_dt (zero-filled) but GpuDelay used uninitialized_dt. On the first pulse, from_buffer elements were copied from uninitialized GPU memory to the output. Fix: initialize with Tensor::zero_dt(...).into_device(). 2. Overlapping buffer shift: when buffered >= input_pulse, the buffer shift used flat_copy with overlapping source and destination regions within the same GPU buffer. CUDA memcpy is undefined for overlapping regions (parallel threads). Fix: copy via a temporary buffer using assign_slice which operates on distinct tensors. A third issue remains: GpuPulsePad crashes with CUDA_ERROR_ILLEGAL_ADDRESS at layer 3's depthwise conv, despite identical ops succeeding in layers 0-2. The crash is from a pending GPU error detected at PulsePad's first memory access, suggesting a kernel in the attention pipeline corrupts memory. The exact source is not yet identified — likely a memory pool aliasing issue or a CUDA kernel bug for specific tensor shapes.

flat_copy passed byte_len as the shape to copy_nd, but the CUDA kernel dispatched based on datum_type operates on typed elements (u32 for f32). For f32 tensors, this caused the kernel to access 4x the intended range, corrupting GPU memory. Fix: divide byte_len by element size before passing to copy_nd. Also validate alignment of offsets and lengths. This was the root cause of the CUDA_ERROR_ILLEGAL_ADDRESS crash when running the pulsified Nemotron encoder on GPU. Also: enable CUDA runtime in the streaming ASR example.

Pulsify the preprocessor (17920 samples/pulse = 112 feature frames) instead of re-running on accumulated audio. Buffer feature frames between preprocessor and encoder to handle the 3-frame delay offset. Pipeline per pulse: preproc ~10ms + encoder ~25ms = ~35ms for 1.12s of audio (32x real-time on CUDA). Bonus: the pulsified preprocessor produces slightly different output at boundaries, and the transcript now matches the batch reference exactly (including "eyes. It" with period).

- Carriage return display: transcript grows in-place on one line - [pre] [enc] [jnt] [dec] labels show which model is running - Trailing spaces mask previous longer labels - Stats summary at end: per-model timings and real-time ratio - 22.9x real-time compute on CUDA (325ms for 7.43s audio)

- Log "Loading <model> to <runtime>... done." for each model - Log "Ready (Xs)" when all models loaded - Reduce preprocessor pulse from 17920 to 1600 samples (~100ms) to avoid cumulative latency; feature buffer bridges to encoder

Split the monolithic main() into two structures: - NemotronModels: shared read-only context (runnables, vocab, pulse metadata). Constructed once via NemotronModels::load(). - StreamState: mutable per-session state (model states, audio/feature buffers, decoder RNNT state, stats). Methods replace macros: push_audio(), flush(), run_preproc(), feed_features(), run_encoder_pulse(), decode_frame(). main() is now: load models, create state, pump audio, flush, print.

- `--live` flag captures from the system microphone at 16kHz mono - Requires the `live` feature: `cargo run --features live -- --live` - cpal is an optional dependency behind the `live` feature gate - WAV file mode remains the default (no extra dependencies) - `--no-realtime` flag for WAV mode to process as fast as possible

- Add encoder pulsified end-to-end run test (not just dump -q). Uses concretize BATCH=1, patch length, select outputs, then pulse. Runs without output assertion (tail mismatch from partial last pulse is expected; batch test covers correctness). - Add preprocessor pulsification test with small pulse (1600 samples, ~100ms) in addition to existing 4800. - Remove stale ci-failing.sh from ex16 (now passes with DiagGather).

kali added 30 commits April 8, 2026 15:18

harness/sdpa-pulse: add d2 progress-report diagrams

36833d8

One diagram per harness stage (batch + pulsed where applicable). block-left-1-mask gets three: original batch, after FoldWindowAttention (mask subgraph eliminated), and final pulsed form.

README: add ex06-batch-multihead entry and update arc

d6fa643

ex02: rename ci.sh to ci-failing.sh (multi-input handle_stream not ye…

ea23b38

…t supported)

REVISIT: add item 13 — systematic uniform_tdim propagation

68e9723

REVISIT: add item 14 — classify_chunk_window offset handling

45a7987

sdpa-pulse: add ex14 test cases (reduced APE, reduced skew, large tab…

5044e2c

…le skew)

kali added 27 commits April 9, 2026 06:10

sdpa-pulse: add ex15 — shared posEnc constant with linear projection …

30944b8

…and skew trick

REVISIT: add item 15 — encoder skew trick T→P vs pre-sliced r_pos_window

d0b19ae

ex15: strip to minimal — separate q/k inputs, no v, no softmax, no li…

e0c1c02

…nearPos chain

ex15: remove BATCH — not needed to reproduce

4cdaa3b

ex15: add ci.sh (was dropped during amend)

606303f

Revert "--set and concretize_symbols: accept TDim expressions as values"

1b54d14

This reverts commit 44c7ed6.

Fix warnings in GpuDelay and streaming example

95ad7db

Streaming ASR: run all 4 models on GPU, fix pulse metadata read order

4a64a3a

Streaming ASR: Arc<NemotronModels> with spawn() method

99389ed

Streaming ASR: add Config struct with clap derive

8989a0e

kali force-pushed the encoder-pulsification branch from bc8f78d to 4d0044e Compare April 10, 2026 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoder pulsification#2097

Encoder pulsification#2097
kali wants to merge 58 commits intomainfrom
encoder-pulsification

kali commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kali commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant