Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
2447c09
Add WMMA GEMM kernels and wave32 kernel adaptations for RDNA4
vivienfanghuagood Mar 18, 2026
c19dab6
Add LDS-buffered FP8 GEMM kernel for large-M compute-bound shapes
vivienfanghuagood Mar 19, 2026
2e1b1ae
Fix wave_reduce: replace while loop with range_constexpr
vivienfanghuagood Mar 20, 2026
786041a
add navi gpu CI
vivienfanghuagood Mar 23, 2026
24b340e
Merge branch 'main' into fx1201-wmma-gemm
vivienfanghuagood Mar 24, 2026
ead121b
ci: rename navi runner to linux-flydsl-navi-2
vivienfanghuagood Mar 24, 2026
84e4316
ci: run only WMMA tests on navi runner, skip CDNA benchmarks
vivienfanghuagood Mar 24, 2026
4a86747
ci: architecture-aware test filtering for RDNA4/CDNA compatibility
vivienfanghuagood Mar 24, 2026
9a8d16d
ci: fix permission denied on checkout by cleaning root-owned files
vivienfanghuagood Mar 24, 2026
f4c29db
ci: revert sudo rm -rf workspace cleanup (runner now runs as root)
vivienfanghuagood Mar 24, 2026
76ac707
ci: add test_quant.py to CDNA-only skip list, update navi gpu_count to 4
vivienfanghuagood Mar 24, 2026
5bee759
dummy
vivienfanghuagood Mar 24, 2026
9c59fc4
ci: auto-select GPU with most free VRAM in test/benchmark scripts
vivienfanghuagood Mar 24, 2026
f5241af
dummy
vivienfanghuagood Mar 24, 2026
620e26c
fix: make_buffer_tensor RDNA4 compat + example whitelist for non-CDNA
vivienfanghuagood Mar 24, 2026
b32f760
refactor: rename wmma->rdna, unify arch compat config, delete LDS var…
vivienfanghuagood Mar 24, 2026
04beea0
fix: run benchmarks on all archs, fix fp8 benchmark bugs
vivienfanghuagood Mar 24, 2026
af7f7f7
add
vivienfanghuagood Mar 24, 2026
f8747ae
Merge branch 'main' into fx1201-wmma-gemm
vivienfanghuagood Mar 25, 2026
6d0243a
refactor: use T.*/fx.Index, fix gfx12 RDNA detection
vivienfanghuagood Mar 25, 2026
0e4b31c
fix: fail CI when RDNA WMMA benchmarks error
vivienfanghuagood Mar 25, 2026
ae67f90
refactor: rename kernels, merge tests into single file
vivienfanghuagood Mar 25, 2026
35d482f
Merge branch 'main' into fx1201-wmma-gemm
vivienfanghuagood Mar 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/runner-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,7 @@ runners:
linux-flydsl-mi355-8:
gpu_arch: MI355
gpu_count: 8

linux-flydsl-navi-2:
gpu_arch: gfx1201
gpu_count: 4
5 changes: 3 additions & 2 deletions .github/workflows/flydsl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
test:
strategy:
matrix:
runners: [ 'linux-flydsl-mi325-1', 'linux-flydsl-mi355-1' ]
runners: [ 'linux-flydsl-mi325-1', 'linux-flydsl-mi355-1', 'linux-flydsl-navi-2' ]
fail-fast: false
runs-on: ${{ matrix.runners }}
steps:
Expand Down Expand Up @@ -74,7 +74,7 @@ jobs:
path: mlir_install.tgz
# Key includes LLVM commit and hashes of build scripts/workflow.
# Repo is checked out under `flydsl-test/` (see actions/checkout path), so hash paths must include it.
key: mlir-install-${{ hashFiles('flydsl-test/thirdparty/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/CMakeLists.txt', 'flydsl-test/.github/workflows/flydsl.yaml') }}
key: mlir-install-${{ matrix.runners }}-${{ hashFiles('flydsl-test/thirdparty/llvm-hash.txt', 'flydsl-test/scripts/build_llvm.sh', 'flydsl-test/CMakeLists.txt', 'flydsl-test/.github/workflows/flydsl.yaml') }}

- name: Use cached MLIR install tarball (skip LLVM build)
if: steps.mlir-cache.outputs.cache-hit == 'true'
Expand Down Expand Up @@ -127,6 +127,7 @@ jobs:
retention-days: 7

- name: Install aiter
if: ${{ !contains(matrix.runners, 'navi') }}
run: |
docker exec flydsl_test bash -c "git clone --depth 1 --recursive --shallow-submodules https://github.com/ROCm/aiter.git /tmp/aiter && cd /tmp/aiter && python3 setup.py develop"

Expand Down
9 changes: 6 additions & 3 deletions kernels/layernorm_kernel.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@

EPS = 1e-5

import math
from kernels.kernels_common import get_warp_size

BLOCK_THREADS = 256
WARP_SIZE = 64
WARP_SIZE = get_warp_size()
VEC_WIDTH = 8
USE_NONTEMPORAL = True
VEC_ALIGN = 16
Expand Down Expand Up @@ -91,8 +94,8 @@ def layernorm_kernel(
def wave_reduce_add(x):
width_i32 = fx.Int32(WARP_SIZE)
w = x
for sh in [32, 16, 8, 4, 2, 1]:
off = fx.Int32(sh)
for _sh_exp in range_constexpr(int(math.log2(WARP_SIZE))):
off = fx.Int32(WARP_SIZE // (2 << _sh_exp))
peer = w.shuffle_xor(off, width_i32)
w = w.addf(peer, fastmath=fm_fast)
return w
Expand Down
Loading
Loading