Skip to content

Encoder pulsification#2097

Open
kali wants to merge 58 commits intomainfrom
encoder-pulsification
Open

Encoder pulsification#2097
kali wants to merge 58 commits intomainfrom
encoder-pulsification

Conversation

@kali
Copy link
Copy Markdown
Collaborator

@kali kali commented Apr 2, 2026

No description provided.

kali added 30 commits April 8, 2026 15:18
…timised away

ChangeAxes legitimately squeezes singleton batch/stream dims from intermediate
EinSum outputs during pulse-declutter. When that happens, slicing along the
expected output_axis no longer makes sense. Add a streaming_axis_ok guard in
handle_stream that skips such nodes instead of crashing or producing garbage.

Also add harness/sdpa-pulse/block-l-eq-p: block-diagonal bidirectional attention
(L=P=2, left_chunks=0) as a first pulsification smoke-test. Uses
--allow-random-input so both handle_stream and compare() share the same
fixed-seed input.
When a slice has symbolic begin/end (e.g. end=S on a [S+1,...] tensor),
the min()-clamping graph ops in the NNEF slice deserializer produced
min(S,S+1)-min(0,S+1) for the DynSlice `len` instead of S.  This caused
concat shape mismatches at model-build and declutter time.

Fix: extract begin[ix]/end[ix] as TDim directly from the original Const
tensors (handling i32, i64, and TDim datum types) and use those as the
primary `len` computation, bypassing the clamping imprecision.  Also
replace the b/e outlets with fresh simplified Const nodes so DynSlice
declutter sees the correct values.

Add harness/sdpa-pulse/block-left-1: block attention with left_chunks=1.
Each chunk c attends to its own tokens and the previous chunk's tokens
(k_prev via pad(before=1)+slice(end=S) which pulsifies to Delay(1,0)).
Streaming compare passes: 16 nodes.
The slice deserializer's min()-clamping sub-graph couldn't prove
min(0, S+1) = 0 because S had no lower-bound constraint.
Adding `extension tract_assert S>=0` to the NNEF gives the TDim
prover the information it needs, making the workaround unnecessary.

Revert the get_tdim_at / begin_ix_tdim / end_ix_tdim patch from
nnef/src/ops/nnef/deser.rs and keep the fix where it belongs: in
the NNEF file that introduces the symbol.
block-l-eq-p-mask: block-diagonal attention with an external all-true
boolean mask fed through select+softmax. Tests that Iff+softmax
pulsifies correctly. Batch passes; streaming compare is blocked by
handle_stream only supporting single-input models.

block-left-1-mask: flat-token sliding-window attention (T=8, P=2,
left_chunks=1, Dh=4) with the attention mask computed entirely inside
the graph, adapted from the real Nemotron encoder mask construction:
  range -> cast -> div/floor -> cast -> unsqueeze -> sub -> le/ge -> and
The 'length' external input is replaced by shape_of(qkv)[0]. No padMask.

Batch reference passes. Streaming compare fails with "Undetermined
symbol (range)#N": in the pulsified model T=shape_of(qkv)[0] remains
the streaming symbol S, Range creates a fresh symbol for its output
length that the pulsifier can't resolve. This is the expected failure
that drives the next implementation step: FoldUniformMask needs to
bound the attention window before pulsification, so that the Range
sub-graph in the mask computation is eliminated and the K/V lookback
is bounded.
…lsifiers

FoldWindowAttention (new optimizer pass, fires in declutter before
FoldUniformMask) detects the pattern:

  Iff(chunk_window_mask, scores) → Softmax → EinSum(v)

where the condition wire carries a uniform_tdim encoding a 2-D chunk-window
predicate (0 ≤ floor(i/P) − floor(j/P) ≤ L), and rewrites it to bounded-
window chunk-layout attention equivalent to block-left-L:

  q_c [C,P,D] = reshape(q)
  k_ctx [C,(L+1)P,D] = concat of lagged copies via Pad+Slice
  scores_c = einsum("cpd,cld->cpl", q_c, k_ctx)
  attn_c   = softmax(scores_c, axes=[2])
  out_c    = einsum("cpl,cld->cpd", attn_c, v_ctx)
  output   = reshape(out_c)

This eliminates the T×T attention and the Range/mask subgraph before
pulsification, replacing them with a bounded K/V context the pulsifier can
handle via Delay ops.

classify_chunk_window (core/src/ops/logic.rs) recognises the TDim expression
produced by uniform_tdim propagation through floor(i/P)−floor(j/P) ≤ L.

PulsedTokenFold / PulsedTokenUnfold (pulse/src/ops/array/reshape.rs) are new
pulsifiers for AxisOp::Reshape registered via the OpPulsifier inventory:
- fold   [T,D]→[C,P,D], pulse=P  →  Add(0) at pulse time, stream axis=0 dim=C
- unfold [C,P,D]→[T,D], pulse=1  →  Rm(0)  at pulse time, stream axis=0 dim=T

change_shape_array gains a fallback branch for symbolically-equivalent-but-
structurally-different from/to specs (e.g. S vs P*(S/P)).

harness/sdpa-pulse/block-left-1-mask now passes batch and streaming (20/20).
gen-inputs.py updated to use zero-padded K/V context matching FoldWindowAttention
semantics (startup chunk attended by zeros, consistent with Delay discarding).
REVISIT.md documents five items to generalise before the real encoder lands.
One diagram per harness stage (batch + pulsed where applicable).
block-left-1-mask gets three: original batch, after FoldWindowAttention
(mask subgraph eliminated), and final pulsed form.
…ation infrastructure

Adds `ex05-block-left-1-posenc`: flat-token windowed attention (left_chunks=1, P=2)
with an ALiBi relative-position bias added to scores before masking.
FoldWindowAttention does not fire (Iff true-branch is Add(EinSum, pos_bias),
not a bare EinSum), so this exercises a new ROI-based pulsification path.

New optimizer passes and ops:
- `UniformTDim` op (core): materialise a TDim boolean expression as a tensor at
  runtime; used for the chunk-window mask.
- `FoldUniformTDim` pass (core): replace producer subgraphs of bool wires that
  carry `uniform_tdim` metadata with a single `UniformTDim` node.  Only bool
  wires are folded; f32 wires retain `uniform_tdim` as metadata for later passes.
- `PropagateRoi` extended: now walks backward through `TypedBinOp` chains,
  annotating every wire with `region_of_interest`.  This lets the binary pulsifier
  and the ROI-aware EinSum pulsifier fire for the pos-bias subgraph.

New pulsifiers (pulse crate):
- `binary.rs` — `TypedBinOp` pulsifier: when a wire has both ROI and `uniform_tdim`
  (e.g. `rel_pos = sub(row_pos, col_pos)`), evaluates the integer coordinate
  expression at window-local positions to produce a constant `[P, (L+1)*P]` tensor.
- `mask.rs` — `Iff` pulsifier: when the condition wire carries a chunk-window
  `uniform_tdim`, elides the Iff and passes the true branch through directly.
  This avoids a shape-inference problem where the fill's symbolic `[S,S]`
  MultiBroadcastTo would cause downstream ops to inherit an unresolvable shape.
- `einsum.rs` — two-case EinSum pulsifier:
  - **QK case** (existing, improved): uses `PulseWrappingOp` so Q's streaming axis
    propagates to the scores output, enabling downstream ops to stream correctly.
  - **AV case** (new): detects the AV pattern (streaming V on a contracted axis),
    applies `Delay(overlap=L*P)` to V, then uses `PulseWrappingOp` for the output.
- `uniform_tdim.rs` — `UniformTDim` pulsifier: produces an all-true `[P, (L+1)*P]`
  constant (within the ROI window the mask is always satisfied).
- `mod.rs` — registers `binary` and `mask` modules in `register_all_mod!`.

Harness reorganisation: renamed `block-*` → `ex0N-block-*` for natural ordering.

Streaming comparison for ex05 is deferred (requires FoldWindowAttention to handle
the biased-scores pattern and switch to zero-padded startup semantics).
FoldWindowAttention detected Iff(chunk_window_mask)→Softmax→EinSum and
rewrote the T×T attention to bounded-window chunk-layout before pulsification.
This mixed attention-universe rewrites with pulse-universe machinery and was
unnecessary: ex04 and ex05 already stream correctly via ChunkWindowMask +
binary pulsifier acting on the flat attention graph directly.

Changes:
- Delete core/src/optim/fold_window_attention.rs and remove from declutter()
- compare --stream: round stream_dim up to multiple of input_pulse to avoid
  partial NaN-padded last pulse propagating through K_ctx
- compare --stream: skip structurally incompatible intermediates (yellow)
  rather than failing when pulsed shapes differ from reference (e.g. [P,kw]
  vs [S,S] for windowed attention)
- delay.rs: use zero_dt for startup buffer (NaN workaround; root cause
  documented in REVISIT item 8: pulsify_qk does not propagate K delay)
- ex04 gen-inputs.py: switch to -inf masking (natural batch graph semantics),
  removing the zero-padded startup semantics that matched FoldWindowAttention
- ex05 ci.sh/gen-inputs.py/graph.nnef: remove stale FoldWindowAttention notes
- README/REVISIT/doc d2: update to reflect current approach

All four streaming harnesses pass: ex01, ex03, ex04, ex05.
ChunkWindowMask always produced a rank-2 [P, key_window] mask, but
Iff.output_facts requires all three inputs to have identical rank.
For rank-4 attention [B, H, P, kw], the mask and fill were rank-2,
causing a rank-mismatch error at pulsification time.

Fix (pulse/src/ops/mask.rs): after wiring ChunkWindowMask, add
NonPulsingWrappingOp(AxisOp::Add(0)) for each leading dimension needed
to promote the mask from [P,kw] to [1,...,1,P,kw].  Update the fill
tensor shape from [1,1] to [1;true_rank] for the same reason.

The EinSum pulsifiers (pulsify_qk, pulsify_av) already handle arbitrary
ranks — they use the axis indices from classify_chunk_window (which
extracts the actual row/col axes from the uniform_tdim coord symbols
after remapping through unsqueeze) and the streaming fact axis fields.

New harness ex06-batch-multihead: B=1, H=2, T=8, P=2, left_chunks=1,
Dh=4; qkv [1,2,S,12] streaming on axis 2.  Batch and streaming both
pass.
…ssions

Add ex07-block-left-1-chunkpos harness: sliding-window attention with
a chunk-level relative-position bias

  v_bias[i,j] = slope * (floor(i/P) - floor(j/P))   slope = -0.5

This is the Transformer-XL v-bias concept (constant additive position
term) expressed as a direct arithmetic formula.

At pulse time the `chunk_diff` wire carries
  uniform_tdim = Div(🎯0, 2) - Div(🎯1, 2)
which the binary pulsifier evaluates at steady-state coordinates,
exercising integer-division inside a TDim coordinate expression —
new machinery not covered by ex05's linear (i-j) bias.

Also add REVISIT item 9 documenting what would be needed to handle
the full Transformer-XL Q@R content-to-position term (Pad+Reshape+Slice
skew currently breaks uniform_tdim propagation).
…rces and multi-axis pad

MultiBroadcastTo pulsifier was computing per-pulse output size by substituting
S→pulse directly into op.shape.  For shapes derived from shape_of(strided conv),
e.g. 1+S/2, this yielded 1+P/2 instead of the correct P/2 — one frame too many.

Root cause: the batch formula includes a constant boundary term (frames produced
before any real input arrives due to pre-padding) that should not contribute to
the per-pulse slot.  Fix: use substitute(S→P) - substitute(S→0) on the streaming
axis, which strips the constant offset and leaves only the linear per-pulse part.

  shape_of(stride-2 conv) = 1 + S/2
    old: substitute(S→4)              = 3   (wrong)
    new: substitute(S→4) - sub(S→0)  = 2   (correct)

Also included:
- pulse/source.rs: allow non-streaming model inputs (e.g. a "length" scalar)
  to pass through pulsification as static sources, provided at least one other
  model input does carry the streaming symbol.
- pulse/pad.rs: support Pad ops that have constant padding on non-streaming axes
  in addition to streaming-axis padding (apply PulsePad for the streaming axis,
  then a plain Pad for the remaining axes).

Regression test: harness/nnef-test-cases/conv-then-shape-of-mask reproduces the
Nemotron encoder pattern (stride-2 conv + tract_core_broadcast with batch-formula
shape), and verifies that batch and streaming modes agree.
…eam symbol

When a static table (e.g. a PE table [9999,D]) is sliced to a length derived
from the streaming symbol — e.g. Slice{axis=0, start=0, end=S} — the Slice
pulsifier previously crashed with "Unexpected streamless fact".

Fix: detect non-streaming input, substitute S→pulse in start/end, and wire a
concrete-size Slice via NonPulsingWrappingOp.  The output is static per pulse
(same PE entries every chunk), and the downstream binary op is handled by
PulseWrappingOp.

Add harness/nnef-test-cases/slice-of-static-with-streaming-size as a regression
test covering both batch and streaming compare modes.
…/O, uniform_tdim propagation

Key changes:

core/ops/logic.rs: classify_chunk_window now tolerates extra Mul factors
  (e.g. padding-validity conditions ANDed in); searches all ordered pairs.

core/ops/binary.rs: And propagates chunk-window uniform_tdim from one
  operand when the other has none (padding mask whose chain broke at
  the audio-length scalar).

data/dim/sym.rs: coord symbols (🎯k) are always non-negative; proves
  positivity without a known bound.

pulse/ops/binary.rs (Bool pulsifier): scan input outlets for
  chunk-window uniform_tdim when the stored output fact has none
  (FoldUniformTDim creates UniformTDim nodes but does not re-propagate
  uniform_tdim to downstream stored facts).  Use unwrap_or(1) for
  batch/undetermined symbols in non-window axes.

pulse/ops/mask.rs (Iff pulsifier): same fallback — scan inner_outlet's
  inputs when inner_fact.uniform_tdim is None.

pulse/src/model.rs (into_typed): relax "all inputs/outputs streaming"
  to "at least one streaming", allowing non-streaming auxiliary inputs
  (e.g. sequence-length scalar) and outputs (e.g. encoded_lengths).
  Use -1 sentinel axis for non-streaming I/O; use 0 delay for
  non-streaming outputs.

pulse/ops/array/reshape.rs, slice.rs, uniform_tdim.rs: formatting only.
…s tests

ex08: adds batch dim to ex04's flat T×T masked attention (scores [B,S,S]).
      Passes — establishes that PropagateRoi works through the batch axis.

ex09: adds head dimension H=2 (scores [B,H,S,S], mask broadcast [B,1,S,S]).
      Passes — batch and head axes are transparent to ROI propagation.

ex10: inverted Iff convention: select(~window_mask, -inf, scores).
      Scores are at inputs[2] (false-branch), not inputs[1] (true-branch).
      Batch run passes; streaming fails, exposing two gaps that must be
      fixed before the encoder pulsifies:
        1. PropagateRoi annotates only inputs[1]; must also handle inputs[2].
        2. UniformTDim pulsifier cannot handle `1 + -1*cw` (negated expr).
…vention

Some models (e.g. Nemotron encoder) use select(~window_mask, fill, scores)
where condition=True means "masked out".  Scores are in the false-branch
(inputs[2]) rather than the true-branch (inputs[1]).

The condition's uniform_tdim is `1 + -1*cw` (NOT of the window mask).
Add classify_negated_chunk_window() in logic.rs to detect this form,
then use it in PropagateRoi.run_direct to annotate inputs[2] instead
of inputs[1] when the inverted convention is detected.

Unit test added for the inverted case.
Some models (e.g. Nemotron encoder) use the inverted masking convention
  select(~window_mask, fill, scores)
where condition=True means "masked out" and scores are in the false-branch.

Three operator-local fixes make this pulsify correctly:

1. logic.rs: add peel_negated_chunk_window_expr() to extract the positive
   chunk-window TDim from a negated expression `1 + -1*cw`.
   Add classify_negated_chunk_window() built on top of it.

2. propagate_roi.rs: detect the inverted convention by checking whether the
   condition's uniform_tdim is a negated chunk-window.  Annotate inputs[2]
   (false-branch = scores) with the *positive* CW expression as the ROI,
   so the QK EinSum pulsifier correctly inserts a windowed Delay on K.

3. uniform_tdim.rs: handle negated CW expressions by emitting an all-False
   constant mask (in-window → False for the inverted convention).

4. mask.rs (Iff pulsifier): FoldUniformTDim may replace not(window_mask)
   with a UniformTDim carrying the negated expression, so peel_condition()
   may not detect the inversion.  After classify_chunk_window fails on the
   inner expression, try classify_negated_chunk_window and XOR inverted.
   Also extend peel_condition to recognise ElementWiseOp(Not) (NNEF `not`)
   in addition to ElementWiseOp(BitNot).

ex10-batch-multihead-projections now passes both batch run and compare --stream.
Add TypedOp::input_roi() — a per-input ROI annotation method that lets
ops declare which of their inputs should receive a chunk-window
region_of_interest derived from another input's uniform_tdim.

ScaledMaskedSoftmax implements input_roi: reads the mask input's
uniform_tdim (handling both standard and negated CW expressions) and
returns Some(roi) for the scores input, None for the mask input.

ScaledMaskedSoftmax also implements axes_mapping returning
AxesMapping::natural, enabling the generic PulseWrappingOp fallback to
track the streaming row axis through the op.

PropagateRoi::run_direct gains a second sub-loop that calls
op.input_roi(...) on every non-Iff node and propagates ROI up through
any Some(roi) slot.

Fix UniformTDim pulsifier for left_chunks > 0: instead of a constant
all-True tensor (which incorrectly treats zero-padded K positions as
valid during startup), emit a ChunkWindowMask stateful op followed by
AxisOp::Reshape to restore leading singleton dims. This makes the mask
correct at each chunk for both Iff-based and SMS-based attention.

Add ex11-batch-scaled-masked-softmax harness: exercises the full SMS
pulsification path (PropagateRoi via input_roi → pulsify_qk K delay →
SMS via PulseWrappingOp → pulsify_av V delay).

REVISIT item 10: unify Iff-specific PropagateRoi loop into Iff::input_roi.
…left_chunks>0)

PropagateRoi now propagates the chunk-window ROI backward through the full
skew-trick pos_scores chain (Add→Slice→Reshape→Slice→Reshape→Pad→EinSum→R).
Each operator contributes a TypedOp::input_roi hook:
- Slice, Pad, DynSlice: pass output ROI to input[0]
- AxisOp::Reshape: pass ROI with row/col axes swapped when the reshape touches
  both the row and col axes (the skew step)
- EinSum: annotate the key/position input (has col axis, not row axis)

The Slice pulsifier extends each slice's range by L*P in the direction that
matches the semantic role of the slice:
- Fixed-start slices (pos_sliced, pos_scores): add L*P to end
- Center-anchored slices (R extraction: start=center-S): subtract L*P from start

Verified: all 13 sdpa-pulse harness examples pass; propagate_roi unit tests pass.
Adds ex13-rel-pos-skew-window harness example (left_chunks=1, W=4, P=2).
…unds

Range pulsifier: fire when all inputs are non-streaming (static start/end/step)
but the output length contains the streaming symbol.  Wraps in NonPulsingWrappingOp
so the range tensor is re-evaluated each pulse from the concrete per-pulse bounds.

nnef range_load: compute symbolic length from constant start/end/step when possible,
preserving the expression (e.g. T_tokens) instead of a fresh free symbol.  This
prevents shape contamination during pulsification.

Also adds ex12-rel-pos-skew harness example (RPE skew trick, left_chunks=0).
…e facts forward pass

Three cooperating fixes enable the Nemotron encoder to pulsify correctly:

1. **DynSlice::output_facts**: propagate `uniform_tdim` from the input when `begin=0`.
   The mask computation chain passes through a DynSlice (e.g. attMaskSlice0), and
   `without_value()` was zeroing out `uniform_tdim`.  With begin=0 the slice result
   coordinates are identical to the source coordinates, so the predicate stays valid.

2. **PropagateRoi forward pass**: before the Iff/input_roi annotation loops, walk
   the graph in topological order and re-run `output_facts` for any node whose output
   lacks `uniform_tdim` but whose input has it.  This fixes stale facts left behind
   when patch-based passes (FoldUniformTDim) shunt a subgraph to a UniformTDim node:
   downstream nodes (Cast, And, BitNot, AddAxis) keep their old facts; this pass
   refreshes them so PropagateRoi sees the correct uniform_tdim on every wire.

3. **UniformTDim pulsifier**: change the output shape from `Vec<usize>` to `TVec<TDim>`
   so that symbolic batch dimensions (e.g. BATCH) are carried through as TDim::Sym
   rather than failing with "Undetermined symbol".  The ChunkWindowMask reshape uses
   the TDim shape directly; the constant-tensor fallback evaluates symbolic dims to 1.

Also: ScaledMaskedSoftmax::input_roi now builds a properly classified chunk-window ROI
(using build_chunk_window_roi) rather than returning the raw TDim expression, so the
ROI axes are shifted correctly when scores have extra leading (heads) dimensions.

Harness: nemotron encoder pulsification test added (pulse=112 = 14 chunks × 8x stride).
- Iff::input_roi: detect inverted convention (select(~mask, fill, scores))
  and annotate inputs[2] with the positive ROI expression instead of
  always annotating inputs[1]
- ScaledMaskedSoftmax: add axes_mapping (AxesMapping::natural) so the
  generic pulsifier can track the streaming axis through the op
- Slice/DynSlice pulsifiers: guard ROI-driven range extension against
  the input's actual size — if the upstream chain hasn't been expanded,
  fall through to the non-ROI path instead of producing out-of-bounds
  slices
- Remove duplicate old-signature input_roi implementations left over
  from the rebase (binary, pad, slice, change_axes, einsum, dyn_slice)
When the output wire has ROI but no uniform_tdim (e.g. Mul(rel_pos, -0.125)
where the float scaling breaks TDim propagation), walk upstream through
scalar-constant TypedBinOp nodes to find the nearest uniform_tdim.

After evaluating the integer coordinate expression, replay the scalar ops
(Mul, Add, etc.) in forward order to recover the actual output values
including any float scaling.

Fixes ex05 (ALiBi position bias) and ex07 (chunk-level position bias)
which regressed after rebasing onto main's systematic ROI propagation.
After ROI bubbles through Pad/Reshape, coordinate symbols in the
chunk-window expression may be offset (e.g. Div(🎯k+1, P) instead of
Div(🎯k, P)) and extra constant terms may appear in the diff.

Generalize extract_div_diff_axes to:
- Accept Div(Add(Sym, Val), P) in addition to Div(Sym, P)
- Ignore Val(_) constant terms in the Add

The offsets don't change P, L, or the axis assignment — they're
positional shifts from upstream coordinate transforms.

Fixes ex13 (skew trick with left_chunks=1) which failed because
the EinSum's input_roi couldn't classify the shifted ROI expression
and didn't propagate ROI to the R position table input.
When the K/R input to a QK EinSum is non-streaming, detect if it's a
computable constant (via try_compute_const for shared upstream chains)
and pre-slice it for the symmetric RPE case (feeds_into_nonlinear).

This restores the pre-rebase behavior where pulsify_qk pre-sliced the
r_pos table to [W+P-1, Dh] centered at the zero-relative-position entry.

For non-const, non-streaming K (e.g. DynSlice of r_full depending on S),
gracefully decline and let the generic pulsifier + ROI-aware DynSlice/Slice
pulsifiers handle it.

Fixes ex14-reduced-skew and ex14-rel-pos-skew-large-table.
When r_pos_proj can't be evaluated directly (symbolic shapes from streaming
symbol), substitute the streaming symbol with the pulse value and retry.
This handles the encoder pattern where posEnc[0:T] @ W_pos has T depending
on the streaming symbol.

Also tighten the non-streaming K input detection: require that the matched
input maps to the output col_axis but NOT the row_axis (same criterion as
EinSum::input_roi). This prevents matching V, biases, or weight tensors
that happen to have the col axis.
kali added 27 commits April 9, 2026 06:10
Slice only implements eval_with_session (uses session.resolved_symbols
to evaluate start/end). The plain eval returns "stateless evaluation
not implemented". After concretize_dims produces concrete TDim::Val
bounds, eval_with_session works with an empty TurnState.

Also use translate_model_with_mappings in try_compute_const_with_substitution
to correctly map outlet IDs when concretize_dims changes node topology.
…umers)

Reproduces the encoder pattern: DynSlice on posEnc with symbolic bounds,
concretize_symbols(BATCH=1) before pulse, and shared posEnc across
multiple linearPos EinSums.

Currently passes at small scale (T=8, D=8). The encoder's failure at
full scale (T=589, D=1024) is under investigation — pulsify_qk's
non-streaming K path isn't being reached for the encoder's pos_raw
EinSums despite correct ROI annotations.
…mall

When try_compute_const_with_substitution evaluates the posEnc chain
with one pulse worth of streaming symbol, the resulting symmetric RPE
table may be too small for the key window (t_max < key_window).

Re-evaluate with a larger symbol value (key_window * pulse) to get
enough rows for the pre-slice.  This handles the encoder pattern
where posEnc[center-T:center+T-1] produces 2*T-1 rows and T at
one pulse is much smaller than the needed window.
With left_chunks=5 (W=12), the key window exceeds the skew trick's
T=P-based intermediate shapes. The Slice pulsifiers can't extend
because the input is too narrow — same failure as the encoder.

ci-failing.sh uses left_chunks=5 (fails: broadcast 12 against 2).
The graph.nnef has left_chunks=5 for the failing case; the passing
version (left_chunks=1) can be restored in a ci.sh alongside.
DynSlice RPE + skew trick + BATCH + concretize_symbols + P=4/pulse=4.
Broadcast error: W=16 against P=4 at Add(content_scores, pos_scores).

Works with P=2/pulse=2 but fails at P=4/pulse=4.
Same pattern as the encoder (P=14, left_chunks=5, W=84).
Restore AV EinSum (output must be linear in S, not quadratic).
Increase T to 16 so T/P = 4 ≥ L+1 = 4.
Enlarge r_full to [31, 4] (= 2*T-1).

Both batch and pulsed runs pass.
Rename ci-failing.sh to ci.sh since this is now a passing test.
Add stride-2 subsampling so the token dimension T = S/2 differs from
the audio streaming symbol S.  Pulse on S with pulse=4 gives T=2 tokens
per pulse, but W=(3+1)*2=8.  The skew trick Reshapes use T=2, producing
intermediates too narrow for W=8.  Broadcast error: 8 against 2.

Same failure as encoder.p1: AUDIO_SIGNAL__TIME/8 ≠ AUDIO_SIGNAL__TIME.

Also fix panic in PropagateRoi when a node doesn't implement TypedOp.
Both --set (CLI flag) and concretize_symbols (transform config) now
accept any TDim expression as a value, not just integers.

  --set T=S/2                          # symbolic substitution
  --set BATCH=1                        # integer (backward compat)
  -t 'concretize_symbols(values: {"BATCH": 1})'    # RON integer
  -t 'concretize_symbols(values: {"T": "S/2"})'    # RON string

Integer values go through the existing SymbolValues/concretize_dims
path. Symbolic TDim values are applied first via TDim::substitute on
all model facts, then the integer values are concretized.

SymbolValues gains set_tdim/get_tdim/tdim_iter for symbolic entries.
ConcretizeSymbolsConfig uses a custom deserializer (StringOrInt) to
accept both RON integers and strings in the values map.
The relative-position "skew trick" (Pad→Reshape→Slice→Reshape→Slice)
converts relative position scores [T, 2T-1] to absolute [T, T] via
intermediate reshapes whose shapes depend on the full sequence length T.
These intermediates cannot be pulsified individually because they create
whole-sequence dependencies.

The composed mapping is a clean diagonal gather:
  pos_scores[i, k] = pos_raw[i, (T-1) + k - i]

DiagGather captures this as a single TypedOp in the pulse crate.
A pre-pulsification fold pass matches the 5-op chain and replaces it;
the pulsifier then trivially produces [P, W] output per pulse using
offset = P_local - 1.

Also in this commit:
- --set accepts TDim expressions (e.g. --set S=2*s) for symbol substitution
- --set runs before --pulse so substitutions take effect pre-pulsification
- TDim::eval checks tdim_values for symbolic substitutions
- PropagateRoi: simplify ROI demands, skip trivial ROI=1
- Pre-flight check warns about superlinear wires missing ROI
- PulsedFact evaluates streaming dim expression with pulse symbol
- ex16 test case: double stride-2 subsampling + skew trick
- README: pulsification semantics (increment, hull, contract, ROI, delay)
End-to-end streaming demo for nvidia/nemotron-speech-streaming-en-0.6b:
- Mic thread sends 80-sample chunks (5ms) simulating real-time audio
- Main thread accumulates audio, runs preprocessor, feeds encoder pulses
  (112 audio frames = 14 transformer tokens), then greedy RNNT decode
- Progressive transcript output with model timing tags
- Encoder uses DiagGather for the skew trick, pulsified at 112 frames
1. Buffer initialization: CPU Delay uses zero_dt (zero-filled) but
   GpuDelay used uninitialized_dt. On the first pulse, from_buffer
   elements were copied from uninitialized GPU memory to the output.
   Fix: initialize with Tensor::zero_dt(...).into_device().

2. Overlapping buffer shift: when buffered >= input_pulse, the buffer
   shift used flat_copy with overlapping source and destination regions
   within the same GPU buffer. CUDA memcpy is undefined for overlapping
   regions (parallel threads). Fix: copy via a temporary buffer using
   assign_slice which operates on distinct tensors.

A third issue remains: GpuPulsePad crashes with CUDA_ERROR_ILLEGAL_ADDRESS
at layer 3's depthwise conv, despite identical ops succeeding in layers
0-2. The crash is from a pending GPU error detected at PulsePad's first
memory access, suggesting a kernel in the attention pipeline corrupts
memory. The exact source is not yet identified — likely a memory pool
aliasing issue or a CUDA kernel bug for specific tensor shapes.
flat_copy passed byte_len as the shape to copy_nd, but the CUDA kernel
dispatched based on datum_type operates on typed elements (u32 for f32).
For f32 tensors, this caused the kernel to access 4x the intended range,
corrupting GPU memory.

Fix: divide byte_len by element size before passing to copy_nd. Also
validate alignment of offsets and lengths.

This was the root cause of the CUDA_ERROR_ILLEGAL_ADDRESS crash when
running the pulsified Nemotron encoder on GPU.

Also: enable CUDA runtime in the streaming ASR example.
Pulsify the preprocessor (17920 samples/pulse = 112 feature frames)
instead of re-running on accumulated audio. Buffer feature frames
between preprocessor and encoder to handle the 3-frame delay offset.

Pipeline per pulse: preproc ~10ms + encoder ~25ms = ~35ms for 1.12s
of audio (32x real-time on CUDA).

Bonus: the pulsified preprocessor produces slightly different output
at boundaries, and the transcript now matches the batch reference
exactly (including "eyes. It" with period).
- Carriage return display: transcript grows in-place on one line
- [pre] [enc] [jnt] [dec] labels show which model is running
- Trailing spaces mask previous longer labels
- Stats summary at end: per-model timings and real-time ratio
- 22.9x real-time compute on CUDA (325ms for 7.43s audio)
- Log "Loading <model> to <runtime>... done." for each model
- Log "Ready (Xs)" when all models loaded
- Reduce preprocessor pulse from 17920 to 1600 samples (~100ms)
  to avoid cumulative latency; feature buffer bridges to encoder
Split the monolithic main() into two structures:
- NemotronModels: shared read-only context (runnables, vocab, pulse
  metadata). Constructed once via NemotronModels::load().
- StreamState: mutable per-session state (model states, audio/feature
  buffers, decoder RNNT state, stats). Methods replace macros:
  push_audio(), flush(), run_preproc(), feed_features(),
  run_encoder_pulse(), decode_frame().

main() is now: load models, create state, pump audio, flush, print.
- `--live` flag captures from the system microphone at 16kHz mono
- Requires the `live` feature: `cargo run --features live -- --live`
- cpal is an optional dependency behind the `live` feature gate
- WAV file mode remains the default (no extra dependencies)
- `--no-realtime` flag for WAV mode to process as fast as possible
@kali kali force-pushed the encoder-pulsification branch from bc8f78d to 4d0044e Compare April 10, 2026 08:39
- Add encoder pulsified end-to-end run test (not just dump -q).
  Uses concretize BATCH=1, patch length, select outputs, then pulse.
  Runs without output assertion (tail mismatch from partial last pulse
  is expected; batch test covers correctness).
- Add preprocessor pulsification test with small pulse (1600 samples,
  ~100ms) in addition to existing 4800.
- Remove stale ci-failing.sh from ex16 (now passes with DiagGather).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant