Skip to content

Vector opt review#20

Open
Trav55555 wants to merge 15 commits intoanthropics:mainfrom
Trav55555:vector-opt-review
Open

Vector opt review#20
Trav55555 wants to merge 15 commits intoanthropics:mainfrom
Trav55555:vector-opt-review

Conversation

@Trav55555
Copy link
Copy Markdown

best I could get was 1304

Key optimizations:
- SIMD vectorization with VLEN=8
- 3-batch parallel processing (6 VALU slots)
- Software pipelining: prefetch next triple during hash
- Broadcast optimization for rounds 0 and 11
- Eliminated vselect with arithmetic rewrites
- multiply_add fusion for idx calculation
- Keep idx/val in scratch across all rounds
- Explored speculation (not viable: 48 loads needed, 24 available)
- Explored index deduplication (not viable: indices data-dependent)
- Updated bottleneck analysis and remaining opportunities
- Current: 4294 cycles (34.4x speedup)
Batching optimizations:
- Batch hash constant loads (2 per cycle)
- Batch 5 vector broadcasts into one bundle
- Merge XOR with next-triple address compute (6 VALU ops)
- Batch broadcast round remainder processing
- Batch triple-0 v_node[2] loads (2 per cycle)
- Batch normal round remainder XOR
- Batch all remainder vector index operations together
- Add analysis scripts for bundle distribution and bottleneck analysis
- VALU utilization at 58.2%, theoretical minimum ~2,200 cycles
- Next target: software pipelining to overlap rounds
…eedup)

- Fuse hash stages 0,2,4 using multiply_add (a*(1+2^k)+c pattern)
- Add arithmetic selection for rounds 1,12 (idx in {1,2})
- Preload tree[1], tree[2] and compute node = tree[1] + diff*(idx-1)
- Still need 2.2x more improvement to reach <1700 target
- Vectorize select-round and normal-round loops over BATCH to fill VALU slots
- Update prefetch and writeback paths to operate on batched offsets
- Cuts cycle count to ~3,241 in local benchmarks
- Attach remaining gather loads to index update bundles
- Reduce load-only bundles and improve overlap
- Reach ~2,905 cycles on test harness
- Replace manual scheduling with automatic list scheduler
- vselect for tree levels 0-3 (preload nodes 0-14)
- Tiled processing (group_size=17, round_tile=13)
- Hash fusion for stages 0,2,4
- 92.9% of theoretical minimum achieved
- profiler.py: per-cycle utilization breakdown, histograms, phase analysis
- visualize_schedule.py: ASCII timeline, gap detection
- bottleneck_detector.py: identifies VALU/memory bound, recommendations

Updated implementation guide with tool documentation and workflow.
- Add tmp4 register to extract bit 2 upfront before vselects clobber tmp1
- Eliminates redundant (idx-7) recomputation in level-3 selection
- Saves 64 VALU ops (7267 → 7203)
- New theoretical minimum: 1201 cycles (was 1212)
- Efficiency: 92.1% of theoretical minimum

Added scripts/value_reuse_profiler.py to find recomputation patterns.
Updated implementation_guide.md with session 7 findings.
- Remove zero_vec constant (unused after self-XOR optimization)
- Only allocate hash_vec_consts3 for non-multiply_add stages
- Use idx^idx for wrap instead of idx+zero (already committed)

Maintains 1304 cycles (113.3x speedup from baseline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant