Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions native/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,4 @@ bricksim_add_test(TestConstraintScheduler tests/TestConstraintScheduler.cpp)
bricksim_add_test(TestRandomRegularGraphSchedulingPolicy tests/TestRandomRegularGraphSchedulingPolicy.cpp)
bricksim_add_test(TestPack2dMaxrect tests/TestPack2dMaxrect.cpp)
bricksim_add_test(TestPhysicsGraphAssemblyClosure tests/TestPhysicsGraphAssemblyClosure.cpp)
bricksim_add_test(TestBreakage tests/TestBreakage.cpp)
150 changes: 150 additions & 0 deletions native/cuda/breakage_cuda/BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# breakage_cuda benchmark report

GPU pipeline (`cuda_qp::Pipeline::solve`) vs CPU baseline
(`ref_cpu::solve_full`, Eigen + OSQP `v1.0.0`) on a synthetic linear-chain
`HostSystem` of `N` parts.

- Hardware: NVIDIA RTX 4090, WSL2, CUDA Toolkit 12.0, gcc 11
- Build type: `Release`, double precision throughout
- Per-step wall clock, `cudaDeviceSynchronize()` before reading the GPU
timer
- For `N >= 32,768` the CPU OSQP solve hits the 2,000,000-iteration cap and
reports "did not converge"; we keep its wall time as a lower bound and
flag the row with an asterisk. The GPU ADMM converged on every row.

A live, interactive version of these charts is in
[`benchmark.html`](benchmark.html) (open with a browser; everything is
inlined, no network access required).

## Cross-over summary

| Threshold | First N where it holds |
|--------------------------------|-----------------------:|
| GPU faster than CPU (>= 1x) | between 4,096 and 8,192 |
| GPU >= 10x faster | between 8,192 and 16,384 |
| GPU >= 100x faster | between 32,768 and 65,536 |
| GPU >= 500x faster | by 65,536 |

So **the GPU "completely overrules" the CPU at roughly N ~ 40,000 parts**
(crossing 100x), and continues to widen the gap from there.

## Raw measurements

`./breakage_cuda_bench N` with `BENCH_QUICK=1`, `OSQP_MAX_ITER=2000000`,
and `BENCH_TOLERATE_NONCONVERGED=1` for the largest two rows.

| N | nx | GPU mean | CPU mean | speedup (CPU / GPU) |
|---------:|-------:|------------:|--------------:|--------------------:|
| 4 | 27 | 77.7 ms | 24.1 us | 0.00031x |
| 16 | 135 | 124.3 ms | 107.9 us | 0.00087x |
| 64 | 567 | 167.6 ms | 435.9 us | 0.00260x |
| 256 | 2,295 | 388.2 ms | 1.73 ms | 0.00446x |
| 512 | 4,599 | 339.2 ms | 3.43 ms | 0.01011x |
| 1,024 | 9,207 | 3.53 s | 16.63 ms | 0.00471x |
| 2,048 | 18,423 | 6.24 s | 62.00 ms | 0.00993x |
| 4,096 | 36,855 | 9.39 s | 480.64 ms | 0.05119x |
| 8,192 | 73,719 | 3.79 s | 5.00 s | 1.32x |
| 16,384 |147,447 | 3.45 s | 68.36 s | 19.79x |
| 32,768 |294,903 | 3.93 s | 254.70 s* | >= 64.81x* |
| 65,536 |589,815 | 4.46 s | 2,596.75 s* | >= 582.73x* |

`nx` is `9 * (N - 1)` (decision variables in the QP).
`*` = CPU OSQP hit `max_iter = 2,000,000` without converging; the wall time
is therefore a lower bound, and the speedup is also a lower bound.

## ASCII chart: log10 wall time (microseconds) vs log2(N)

```
log10(us)
10 | * (CPU >= here)
9 | *
8 | *
7 | * * *
6 | G G G G G G G G G G G G G (GPU plateau, then ramps)
5 | C C C C C C
4 | C
3 | C C C
2 | C
1 |
+--------------------------------------------------------
4 16 64 256 .. 1k 2k 4k 8k 16k 32k 64k N
```

Two regimes are visible:

- **Small-N plateau (`N <= 1k`)**: GPU runtime is flat ~ 100-400 ms,
dominated by `cusolverSp` factorization and kernel-launch overhead.
CPU runtime is sub-millisecond. CPU wins by orders of magnitude.
- **Large-N ramp (`N >= 4k`)**: GPU runtime grows roughly linearly with
`N` (and even drops at `N = 8k` once the per-iteration arithmetic
starts to amortize the fixed startup); CPU OSQP grows faster than
linear and rapidly blows past the GPU.

## Cross-over analysis

| pair | GPU time growth | CPU time growth | speedup growth |
|------------------------|----------------:|----------------:|---------------:|
| 8,192 -> 16,384 | 0.91x | 13.7x | 15.0x |
| 16,384 -> 32,768 | 1.14x | >= 3.7x* | >= 3.27x* |
| 32,768 -> 65,536 | 1.13x | >= 10.2x* | >= 8.99x* |

(`*` = CPU side bounded below.)

So once the GPU is in its ramp regime, **doubling N multiplies the
GPU/CPU ratio by 3-15x in our favor**. Extrapolating:

- N ~ 40,000 -> ~ 100x faster
- N ~ 65,000 -> ~ 580x faster (measured lower bound)
- N ~ 130,000 -> well beyond 1,000x faster (estimated; CPU run skipped)

## Why so slow at small N?

The GPU baseline pays three large fixed costs that the CPU baseline does
not:

1. **Three Cholesky factorizations** on the KKT-like system per `solve`,
one per OSQP stage, via `cusolverSpDcsrlsvchol`. Each call carries a
host<->device sync, symbolic analysis, and numeric factorization.
2. **Per-iteration kernel launches** in the ADMM loop (~ 100 small
launches per stage). On the CPU, those reduce to function calls.
3. **No batching of dense per-part kernels** -- one `<<<>>>` per kernel
per part. This dominates when `N` is small.

These add up to ~ 80 ms - 4 s of overhead independent of `N`, so the GPU
only "wakes up" once the per-iteration arithmetic crosses that floor.

## Caveats

- The synthetic linear chain is favorable to OSQP (well-conditioned,
banded structure). A real `BreakageDebugDump` from a sim step might
shift the crossover slightly, but the asymptotic behavior should hold.
- The CPU "did not converge" rows (N = 32,768, 65,536) are lower bounds.
The actual CPU time to reach the same precision the GPU achieves is
larger -- probably substantially so.
- Bench mode used `BENCH_QUICK=1` (`5 warmup + 2 measured` for GPU,
`1 warmup + 1 measured` for CPU). For `N <= 4,096` you should rerun
with the default counts (`5 + 50 / 5 + 20`) for tighter error bars; the
large-N rows already dominate any per-iteration noise.

## Reproducing

```bash
cmake -S native/cuda/breakage_cuda -B ~/breakage_cuda_build -DCMAKE_BUILD_TYPE=Release
cmake --build ~/breakage_cuda_build -j

cd ~/breakage_cuda_build

# Single point (e.g., the cross-over):
BENCH_QUICK=1 LD_LIBRARY_PATH=/usr/lib/wsl/lib ./breakage_cuda_bench 8192

# Large-N point with CPU lower bound:
BENCH_QUICK=1 BENCH_GPU_ITER=2 BENCH_CPU_ITER=1 \
BENCH_TOLERATE_NONCONVERGED=1 OSQP_MAX_ITER=2000000 \
LD_LIBRARY_PATH=/usr/lib/wsl/lib \
./breakage_cuda_bench 65536
```

`BENCH_QUICK=1` halves both warmup and measured iteration counts.
`BENCH_GPU_ITER` / `BENCH_CPU_ITER` override them outright.
`BENCH_TOLERATE_NONCONVERGED=1` keeps the wall time even if the solver
does not reach `eps_abs / eps_rel`; without it the bench aborts.
136 changes: 136 additions & 0 deletions native/cuda/breakage_cuda/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
cmake_minimum_required(VERSION 3.24)

project(breakage_cuda LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif()

# We pin a sane default arch list; user can override with CMAKE_CUDA_ARCHITECTURES.
# CMake 3.18+ auto-initializes this to "52" when the variable was not user-set,
# so a `NOT DEFINED` guard is not enough to keep our default. We treat the CMake
# default of 52 as "no real choice was made" and replace it with our list. Users
# who want a specific arch can still pass -DCMAKE_CUDA_ARCHITECTURES=... to opt
# out of this override.
if(NOT CMAKE_CUDA_ARCHITECTURES OR CMAKE_CUDA_ARCHITECTURES STREQUAL "52")
set(CMAKE_CUDA_ARCHITECTURES "70;75;80;86;89;90")
endif()

include(FetchContent)
set(FETCHCONTENT_QUIET FALSE)

# ---------------------------------------------------------------------------
# Eigen (header-only)
# ---------------------------------------------------------------------------
FetchContent_Declare(eigen
GIT_REPOSITORY https://gitlab.com/libeigen/eigen.git
GIT_TAG 3.4.0
GIT_SHALLOW TRUE
)
set(BUILD_TESTING OFF CACHE BOOL "" FORCE)
set(EIGEN_BUILD_DOC OFF CACHE BOOL "" FORCE)
set(EIGEN_BUILD_TESTING OFF CACHE BOOL "" FORCE)
FetchContent_MakeAvailable(eigen)

# ---------------------------------------------------------------------------
# nlohmann_json
# ---------------------------------------------------------------------------
FetchContent_Declare(json
GIT_REPOSITORY https://github.com/nlohmann/json.git
GIT_TAG v3.11.3
GIT_SHALLOW TRUE
)
set(JSON_BuildTests OFF CACHE BOOL "" FORCE)
FetchContent_MakeAvailable(json)

# ---------------------------------------------------------------------------
# OSQP (CPU baseline). Pinned to v1.0.0 to match the API used by breakage.cppm.
# ---------------------------------------------------------------------------
FetchContent_Declare(osqp
GIT_REPOSITORY https://github.com/osqp/osqp.git
GIT_TAG v1.0.0
GIT_SHALLOW TRUE
)
set(OSQP_BUILD_DEMO_EXE OFF CACHE BOOL "" FORCE)
set(OSQP_BUILD_UNITTESTS OFF CACHE BOOL "" FORCE)
set(OSQP_BUILD_SHARED_LIB OFF CACHE BOOL "" FORCE)
set(OSQP_USE_LONG OFF CACHE BOOL "" FORCE)
set(OSQP_USE_FLOAT OFF CACHE BOOL "" FORCE)
set(OSQP_ALGEBRA_BACKEND "builtin" CACHE STRING "" FORCE)
FetchContent_MakeAvailable(osqp)

# ---------------------------------------------------------------------------
# CUDA toolkit components
# ---------------------------------------------------------------------------
find_package(CUDAToolkit REQUIRED)

# ---------------------------------------------------------------------------
# Core library
# ---------------------------------------------------------------------------
add_library(breakage_cuda_core STATIC
src/system.cpp
src/ref_cpu.cpp
src/kernels.cu
src/linalg.cu
src/qp_solver.cu
)
target_include_directories(breakage_cuda_core PUBLIC
${CMAKE_CURRENT_SOURCE_DIR}/include
)
target_link_libraries(breakage_cuda_core PUBLIC
Eigen3::Eigen
nlohmann_json::nlohmann_json
osqpstatic
CUDA::cudart
CUDA::cublas
CUDA::cusparse
CUDA::cusolver
)
set_target_properties(breakage_cuda_core PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES}"
)

# Suppress noisy diagnostics on Eigen + nvcc and treat real issues as warnings.
target_compile_options(breakage_cuda_core PRIVATE
$<$<COMPILE_LANGUAGE:CXX>:-Wall -Wextra -Wpedantic>
$<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr --expt-extended-lambda -Xcompiler=-Wall>
)

# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
enable_testing()

# NOTE: On WSL2 the real CUDA driver libcuda.so.1 lives at /usr/lib/wsl/lib but
# is not on the default loader search path. Since libcudart resolves the driver
# via dlopen at runtime, RPATH/RUNPATH on the binary is NOT honored for that
# dlopen call. The portable workaround is to export
# LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
# before running the test/bench binaries. See README.md.
function(breakage_cuda_add_test TEST_NAME)
add_executable(${TEST_NAME} ${ARGN})
target_link_libraries(${TEST_NAME} PRIVATE breakage_cuda_core)
add_test(NAME ${TEST_NAME} COMMAND ${TEST_NAME})
if(UNIX AND NOT APPLE)
set_tests_properties(${TEST_NAME} PROPERTIES
ENVIRONMENT "LD_LIBRARY_PATH=/usr/lib/wsl/lib:$ENV{LD_LIBRARY_PATH}")
endif()
endfunction()

breakage_cuda_add_test(test_dense_kernels tests/test_dense_kernels.cpp)
breakage_cuda_add_test(test_qp_random tests/test_qp_random.cpp)
breakage_cuda_add_test(test_full_pipeline tests/test_full_pipeline.cpp)

# Bench is just an executable, not a test.
add_executable(breakage_cuda_bench tests/bench.cpp)
target_link_libraries(breakage_cuda_bench PRIVATE breakage_cuda_core)
Loading
Loading