Conversation
Remove the wave number and wave size template parameters from the entry points of the device xteam reduction functions. Replace the wave size parameter by a call to `__gpu_num_lanes()`, which is optimized out during compilation. Replace the wave number parameter by the constant `32`, which is on the safe side for its current usage situations (VLA size needs to be constant, max number of threads is 1024, min wave size is 32).
AI-Summary:
|
| Aspect | Baseline | Branch |
|---|---|---|
| Scan function names | __kmpc_xteams_d_16x64, _8x64, _4x64, _16x32, _8x32, _32x32 per type |
__kmpc_xteams_d per type |
| Phase-2 functions | __kmpc_xteams_phase2_* (6 variants per type) |
Removed entirely |
| Scan types (OMPKinds.def) | i, d, f, l (plus some in header only: cd, cf, ui, ul) x 6 block configs |
i, d, f, l |
| Parameters (scan) | (val, storage, r_ptr, team_vals, teams_done_ptr, rf, rf_lds, rnv, k, NumTeams) |
(val, result, block_status, block_aggregates, block_prefixes, rf, rnv, k, is_inclusive) |
rf_lds param in scan |
Required (for LDS reduction in Blelloch scan) | Removed (decoupled look-back doesn't need it) |
NumTeams param in scan |
Explicit parameter | Derived from mapping::getNumberOfBlocksInKernel() |
is_inclusive in scan |
Only in phase-2, as a runtime parameter | In the single kernel, as Int1 (boolean) |
| Kernel arguments per red-var | 3 (team_vals, teams_done_ptr, scan_storage) | 1 (scan_storage, sub-arrays computed from it) |
| Total OMPKinds.def entries | 76 (xteamr + xteams + phase2) | 18 (xteamr + xteams) |
7. Test Changes
Codegen tests (6 files, ~12,500 lines changed): Updated CHECK patterns to match new function names (no _16x64/_32x32 suffixes), updated argument counts, and adapted to the merged single-kernel scan codegen.
Runtime tests (4 files, net -3,186 lines):
test_xteamr.handtest_xteams.hwere massively reduced by removing per-warpsize/per-blocksize test macro instantiations. The tests now invoke architecture-agnostic function names.test_xteamr.cppandtest_xteams.cppwere simplified accordingly.- Cross-architecture testing (testing wave32 on wave64 hardware and vice versa) was removed.
Offload tests (4 files, net -174 lines):
- No-loop scan tests updated to reflect single-pass behavior (no phase-2 kernel launch expected in output).
- Segmented scan tests updated for the new argument count (9 instead of 10).
xteam_scan_3.cppwas heavily reduced, with notes about floating-point precision issues at segment boundaries due to non-associativity of FP addition in the cross-team scan.xteam_red_1.c: Minor update (team count change from 460 to 480).
Summary of Key Design Decisions
-
Runtime vs compile-time hardware abstraction: Warp size and block size are now queried at runtime (
__gpu_num_lanes(),mapping::getNumberOfThreadsInBlock()) instead of being compile-time template parameters, eliminating the combinatorial explosion of function variants. -
Single-pass scan: Replacing the two-kernel Blelloch scan with a single-pass decoupled look-back removes a kernel launch, avoids the large double-buffer global memory allocation, and enables overlap of computation with cross-team synchronization. The trade-off is a spin-wait loop in thread 0 of each block during look-back.
-
Shared primitive library: Consolidating wave/block primitives avoids code drift between reductions and scans and establishes a foundation for adding new cross-team collective operations.
-
Simplified codegen: Fewer runtime function variants means less branching in the compiler and fewer entries to maintain in
OMPKinds.def, reducing the surface area for bugs when adding new types or operations.
|
Most relevant TODOs that are still open:
|
|
Re AI-Summary: |
|
Regarding scan: even with a single-pass algorithm it may make sense to have a second kernel that is only responsible for writing back the results. In an earlier stage of my testing, I compared the following codes: template <typename T>
void scan_excl_dot_sim(const T *__restrict a, const T *__restrict b,
T *__restrict out, uint64_t n) {
const uint64_t stride =
(n + XTEAM_TOTAL_NUM_THREADS - 1) / XTEAM_TOTAL_NUM_THREADS;
#pragma omp target teams distribute parallel for num_teams(XTEAM_NUM_TEAMS) \
num_threads(XTEAM_NUM_THREADS) \
is_device_ptr(d_status, d_aggregates<T>, d_prefixes<T>, d_scan_out<T>)
for (uint64_t k = 0; k < XTEAM_TOTAL_NUM_THREADS; k++) {
const uint64_t start = k * stride;
const uint64_t end = (start + stride < n) ? start + stride : n;
T val0 = T(0);
for (uint64_t idx = start; idx < end; idx++)
val0 += a[idx] * b[idx];
get_kmpc_xteams_func<T>()(val0, d_scan_out<T>, d_status, d_aggregates<T>,
d_prefixes<T>, get_rfun_sum_func<T>(), T(0), k);
T running = d_scan_out<T>[k];
for (uint64_t idx = start; idx < end; idx++) {
out[idx] = running;
running += a[idx] * b[idx];
}
}
}template <typename T>
void scan_excl_dot_sim_v1(const T *__restrict a, const T *__restrict b,
T *__restrict out, uint64_t n) {
const uint64_t stride =
(n + XTEAM_TOTAL_NUM_THREADS - 1) / XTEAM_TOTAL_NUM_THREADS;
#pragma omp target teams distribute parallel for num_teams(XTEAM_NUM_TEAMS) \
num_threads(XTEAM_NUM_THREADS) \
is_device_ptr(d_status, d_aggregates<T>, d_prefixes<T>, d_scan_out<T>)
for (uint64_t k = 0; k < XTEAM_TOTAL_NUM_THREADS; k++) {
const uint64_t start = k * stride;
const uint64_t end = (start + stride < n) ? start + stride : n;
T val0 = T(0);
for (uint64_t idx = start; idx < end; idx++)
val0 += a[idx] * b[idx];
get_kmpc_xteams_func<T>()(val0, d_scan_out<T>, d_status, d_aggregates<T>,
d_prefixes<T>, get_rfun_sum_func<T>(), T(0), k);
}
#pragma omp target teams distribute parallel for num_teams(XTEAM_NUM_TEAMS) \
num_threads(XTEAM_NUM_THREADS) \
is_device_ptr(d_scan_out<T>)
for (uint64_t k = 0; k < XTEAM_TOTAL_NUM_THREADS; k++) {
const uint64_t start = k * stride;
const uint64_t end = (start + stride < n) ? start + stride : n;
T running = d_scan_out<T>[k];
for (uint64_t idx = start; idx < end; idx++) {
out[idx] = running;
running += a[idx] * b[idx];
}
}
}_v1 has a designated kernel only for writing |
|
[2026-03-09T11:39:39.244Z] Unexpected Fails: 4 |
yeah, as I mentioned above, that's expected due to the API change |
|
we will need to remaster the 4 tests , unless
i prefer #2 |
|
Yeah, I didn't put up an updated PR for this because as far as I discussed with @CatherineMoore, this PR here isn't landing on amd-staging, anyway. I'll directly go upstream. I just opened this PR so that it can be discussed as needed before the next steps. |
when these land in upstream, the merge that brings them in will fail npsdb, so we might as well move the 4 tests to smoke-dev now and thus avoid the revert problem |
| const uint32_t lane_num = omp_thread_num % _XTEAM_WARP_SIZE; | ||
|
|
||
| // LDS for wave totals during block scan | ||
| static _RF_LDS T wave_totals[_XTEAM_MAX_NUM_WAVES]; |
There was a problem hiding this comment.
note to myself: bring this up in the discussion
|
|
||
| static __XTEAM_SHARED_LDS T xwave_lds[_MaxNumWaves]; | ||
| // LDS array for wave results | ||
| static _RF_LDS T xwave_lds[_XTEAM_MAX_NUM_WAVES]; |
There was a problem hiding this comment.
note to myself: bring this up in the discussion
| } | ||
|
|
||
| // Step 3: Reduce wave results in LDS | ||
| for (unsigned offset = num_waves / 2; offset > 0; offset >>= 1) { |
-------- DO NOT MERGE, this is just for discussion before going upstream --------
This patch supersedes #1246 and includes:
XteamCommon.hfor wave/block reduction/scanPerformance:
(For more detailed performance data see the email I wrote.)
Tests: it's expected that smoke xteam tests fail for now since the simulation tests target the old device runtime API. An updated version of ROCm/aomp#1895 would fix that, although we need to see when it would make sense to fix/land this testing update.
Assisted-by: claude-4.6-opus-high