Vector opt review by Trav55555 · Pull Request #20 · anthropics/original_performance_takehome

Trav55555 · 2026-01-23T21:45:22Z

best I could get was 1304

Key optimizations: - SIMD vectorization with VLEN=8 - 3-batch parallel processing (6 VALU slots) - Software pipelining: prefetch next triple during hash - Broadcast optimization for rounds 0 and 11 - Eliminated vselect with arithmetic rewrites - multiply_add fusion for idx calculation - Keep idx/val in scratch across all rounds

- Explored speculation (not viable: 48 loads needed, 24 available) - Explored index deduplication (not viable: indices data-dependent) - Updated bottleneck analysis and remaining opportunities - Current: 4294 cycles (34.4x speedup)

Batching optimizations: - Batch hash constant loads (2 per cycle) - Batch 5 vector broadcasts into one bundle - Merge XOR with next-triple address compute (6 VALU ops) - Batch broadcast round remainder processing - Batch triple-0 v_node[2] loads (2 per cycle) - Batch normal round remainder XOR

- Batch all remainder vector index operations together - Add analysis scripts for bundle distribution and bottleneck analysis - VALU utilization at 58.2%, theoretical minimum ~2,200 cycles - Next target: software pipelining to overlap rounds

…eedup) - Fuse hash stages 0,2,4 using multiply_add (a*(1+2^k)+c pattern) - Add arithmetic selection for rounds 1,12 (idx in {1,2}) - Preload tree[1], tree[2] and compute node = tree[1] + diff*(idx-1) - Still need 2.2x more improvement to reach <1700 target

- Vectorize select-round and normal-round loops over BATCH to fill VALU slots - Update prefetch and writeback paths to operate on batched offsets - Cuts cycle count to ~3,241 in local benchmarks

- Attach remaining gather loads to index update bundles - Reduce load-only bundles and improve overlap - Reach ~2,905 cycles on test harness

- Replace manual scheduling with automatic list scheduler - vselect for tree levels 0-3 (preload nodes 0-14) - Tiled processing (group_size=17, round_tile=13) - Hash fusion for stages 0,2,4 - 92.9% of theoretical minimum achieved

- profiler.py: per-cycle utilization breakdown, histograms, phase analysis - visualize_schedule.py: ASCII timeline, gap detection - bottleneck_detector.py: identifies VALU/memory bound, recommendations Updated implementation guide with tool documentation and workflow.

…sion)

- Add tmp4 register to extract bit 2 upfront before vselects clobber tmp1 - Eliminates redundant (idx-7) recomputation in level-3 selection - Saves 64 VALU ops (7267 → 7203) - New theoretical minimum: 1201 cycles (was 1212) - Efficiency: 92.1% of theoretical minimum Added scripts/value_reuse_profiler.py to find recomputation patterns. Updated implementation_guide.md with session 7 findings.

- Remove zero_vec constant (unused after self-XOR optimization) - Only allocate hash_vec_consts3 for non-multiply_add stages - Use idx^idx for wrap instead of idx+zero (already committed) Maintains 1304 cycles (113.3x speedup from baseline)

Trav55555 added 15 commits January 22, 2026 12:13

WIP: BATCH=6 for broadcast rounds only (3765->3645 cycles, 40.5x)

dd47206

Expand BATCH=6 to select and gather rounds

e84802b

- Vectorize select-round and normal-round loops over BATCH to fill VALU slots - Update prefetch and writeback paths to operate on batched offsets - Cuts cycle count to ~3,241 in local benchmarks

Update implementation guide with session 5 learnings

0cffada

Widen prefetch window to overlap index update loads

1107260

- Attach remaining gather loads to index update bundles - Reduce load-only bundles and improve overlap - Reach ~2,905 cycles on test harness

update to final solution: 1305 cycles (113x speedup)

4c21d2e

- Replace manual scheduling with automatic list scheduler - vselect for tree levels 0-3 (preload nodes 0-14) - Tiled processing (group_size=17, round_tile=13) - Hash fusion for stages 0,2,4 - 92.9% of theoretical minimum achieved

Experiment: Skip round-15 index updates - Result: 1323 cycles (regres…

3ab1a17

…sion)

Baseline before fixes: 1305 cycles

5ada915

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector opt review#20

Vector opt review#20
Trav55555 wants to merge 15 commits intoanthropics:mainfrom
Trav55555:vector-opt-review

Trav55555 commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Trav55555 commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant