Open
Conversation
Key optimizations: - SIMD vectorization with VLEN=8 - 3-batch parallel processing (6 VALU slots) - Software pipelining: prefetch next triple during hash - Broadcast optimization for rounds 0 and 11 - Eliminated vselect with arithmetic rewrites - multiply_add fusion for idx calculation - Keep idx/val in scratch across all rounds
- Explored speculation (not viable: 48 loads needed, 24 available) - Explored index deduplication (not viable: indices data-dependent) - Updated bottleneck analysis and remaining opportunities - Current: 4294 cycles (34.4x speedup)
Batching optimizations: - Batch hash constant loads (2 per cycle) - Batch 5 vector broadcasts into one bundle - Merge XOR with next-triple address compute (6 VALU ops) - Batch broadcast round remainder processing - Batch triple-0 v_node[2] loads (2 per cycle) - Batch normal round remainder XOR
- Batch all remainder vector index operations together - Add analysis scripts for bundle distribution and bottleneck analysis - VALU utilization at 58.2%, theoretical minimum ~2,200 cycles - Next target: software pipelining to overlap rounds
…eedup)
- Fuse hash stages 0,2,4 using multiply_add (a*(1+2^k)+c pattern)
- Add arithmetic selection for rounds 1,12 (idx in {1,2})
- Preload tree[1], tree[2] and compute node = tree[1] + diff*(idx-1)
- Still need 2.2x more improvement to reach <1700 target
- Vectorize select-round and normal-round loops over BATCH to fill VALU slots - Update prefetch and writeback paths to operate on batched offsets - Cuts cycle count to ~3,241 in local benchmarks
- Attach remaining gather loads to index update bundles - Reduce load-only bundles and improve overlap - Reach ~2,905 cycles on test harness
- Replace manual scheduling with automatic list scheduler - vselect for tree levels 0-3 (preload nodes 0-14) - Tiled processing (group_size=17, round_tile=13) - Hash fusion for stages 0,2,4 - 92.9% of theoretical minimum achieved
- profiler.py: per-cycle utilization breakdown, histograms, phase analysis - visualize_schedule.py: ASCII timeline, gap detection - bottleneck_detector.py: identifies VALU/memory bound, recommendations Updated implementation guide with tool documentation and workflow.
- Add tmp4 register to extract bit 2 upfront before vselects clobber tmp1 - Eliminates redundant (idx-7) recomputation in level-3 selection - Saves 64 VALU ops (7267 → 7203) - New theoretical minimum: 1201 cycles (was 1212) - Efficiency: 92.1% of theoretical minimum Added scripts/value_reuse_profiler.py to find recomputation patterns. Updated implementation_guide.md with session 7 findings.
- Remove zero_vec constant (unused after self-XOR optimization) - Only allocate hash_vec_consts3 for non-multiply_add stages - Use idx^idx for wrap instead of idx+zero (already committed) Maintains 1304 cycles (113.3x speedup from baseline)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
best I could get was 1304