한국어 | English
Unified vLLM serving configuration for NVIDIA DGX Spark dual-node cluster (GB10 x 2).
Supports multiple models (Qwen3.5, Gemma 4) with different quantizations via .env presets — one repo, one Dockerfile, one compose file.
| Node | Role | GPU | Memory | Interconnect |
|---|---|---|---|---|
| spark01 | Ray Head + vLLM API | NVIDIA GB10 (Blackwell) | 119 GiB unified | 200Gbps RoCE |
| spark02 | Ray Worker | NVIDIA GB10 (Blackwell) | 119 GiB unified | 200Gbps RoCE |
Major update: vLLM main with upstream TurboQuant KV cache compression (PR #38479), FlashInfer v0.6.8 with SM121/GB10 optimizations (NVFP4 group GEMM, tile filtering, FP4 CUTLASS). Three upstream patches removed (cuMemcpyBatch, RoPE fix, PR #38423 — all merged). TurboQuant enables 2-4x KV cache capacity via --kv-cache-dtype turboquant_k8v4.
| Component | Version |
|---|---|
| Base Image | NGC PyTorch 26.03 |
| vLLM | 0.20.0.dev (main 978a4462, source build, TurboQuant included) |
| FlashInfer | v0.6.8 (SM121 tile filtering, NVFP4 group GEMM, source build) |
| PyTorch | 2.11.0a0 |
| CUDA | 13.2 (native) |
| NCCL | 2.29.7 |
| Python | 3.12 |
| Transformers | 5.5.4 |
_C_stable_libtorch |
Included (NVFP4/FP8/CUTLASS full ops) |
vLLM 0.19.1 with Gemma 4 support, async scheduling. Transformers 5.5.0. TTFT improved ~2x over v018. Superseded by v020-ngc2603 which adds TurboQuant and FlashInfer v0.6.8.
| Component | Version |
|---|---|
| Base Image | NGC PyTorch 26.03 |
| vLLM | 0.19.1 (main a7d79fa, source build) |
| FlashInfer | v0.6.7.post3 (CUTLASS 4.4.2, SM121 source build) |
| PyTorch | 2.11.0a0 |
| CUDA | 13.2 (native) |
| Transformers | 5.5.0 |
| Preset | Model | Quantization | TP | Image |
|---|---|---|---|---|
gemma4-26b-a4b.env |
google/gemma-4-26B-A4B-it | BF16 MoE (26B/4B active) | 1 | v020-ngc2603 |
qwen3.5-122b-fp8.env |
Qwen/Qwen3.5-122B-A10B-FP8 | FP8 (multimodal) | 2 | v020-ngc2603 |
redhatai-122b-nvfp4.env |
RedHatAI/Qwen3.5-122B-A10B-NVFP4 | NVFP4 (pre-quantized) | 1 | v020-ngc2603 |
intel-122b-int4.env |
Intel/Qwen3.5-122B-A10B-int4-AutoRound | INT4 AutoRound (Marlin) | 1 | v020-ngc2603 |
wangzhang-122b-fp8.env |
wangzhang/Qwen3.5-122B-A10B-abliterated | FP8 (text-only, abliterated) | 2 | v020-ngc2603 |
wangzhang-122b-nvfp4.env |
wangzhang/Qwen3.5-122B-A10B-abliterated-NVFP4 | NVFP4 (text-only, abliterated) | 1 | v020-ngc2603 |
qwen3.5-397b-int4.env |
Intel/Qwen3.5-397B-A17B-int4-AutoRound | INT4 AutoRound (Marlin) | 2 | v020-ngc2603 |
qwen3.5-122b-nvfp4.env |
Qwen3.5-122B-A10B | NVFP4 (runtime) | 1 | v020-ngc2603 |
qwen3.5-122b-nvfp4-tp2.env |
Qwen3.5-122B-A10B | NVFP4 (runtime) | 2 | v020-ngc2603 |
qwen3.6-35b-fp16.env ⚗️ |
Qwen/Qwen3.6-35B-A3B | FP16 original (KV fp8) | 1 | v020-ngc2603 |
# NGC 26.03 + vLLM 0.20.0.dev (TurboQuant + Gemma 4 + Qwen3.5)
docker pull ghcr.io/bjk110/vllm-spark:v020-ngc2603# NGC 26.03 source build (vLLM main, TurboQuant included)
docker buildx build -f Dockerfile.gemma4 \
-t vllm-spark:v020-ngc2603 --load .Build arguments:
| Argument | Default | Description |
|---|---|---|
BUILD_JOBS |
16 | Parallel build jobs |
FLASHINFER_REF |
v0.6.8 | FlashInfer git ref |
VLLM_COMMIT |
978a4462 | vLLM source commit |
TORCH_CUDA_ARCH |
12.1a | Target CUDA arch (Blackwell) |
cp models/qwen3.5-397b-int4.env .envEdit MODEL_PATH in .env to point to your local model weights directory:
# Replace [model_path] with your actual path
sed -i 's|\[model_path\]|/home/user/models|' .env# spark01 (head):
docker compose --profile head up -d
# spark02 (worker):
docker compose --profile worker up -dThe head node automatically waits for the worker to join the Ray cluster before launching vLLM.
cp models/qwen3.5-122b-nvfp4.env .env
docker compose --profile head up -dWhen TP_SIZE=1, the entrypoint skips Ray entirely and runs vllm serve directly.
curl http://spark01:8000/healthspark01 (head) spark02 (worker)
┌─────────────────────┐ ┌─────────────────────┐
│ Ray Head (6379) │ │ Ray Worker │
│ vLLM API (:8000) │◄────────►│ │
│ GB10 GPU │ 200Gbps │ GB10 GPU │
│ TP rank 0 │ RoCE │ TP rank 1 │
└─────────────────────┘ └─────────────────────┘
entrypoint.sh routes automatically based on ROLE and TP_SIZE:
| ROLE | TP_SIZE | Behavior |
|---|---|---|
head |
1 | Direct vllm serve (no Ray) |
head |
2+ | Ray head → wait for workers → vllm serve --distributed-executor-backend ray |
worker |
any | ray start --block (joins head) |
vllm-spark/
├── docker-compose.yml # Unified compose (head + worker profiles)
├── entrypoint.sh # Smart entrypoint (TP1/TP2 auto-routing)
├── .env.example # Full configuration template
├── Dockerfile.gemma4 # v020-ngc2603 (NGC 26.03, latest)
├── Dockerfile.ngc2603-v3 # v018-ngc2603 (NGC 26.03, archived)
├── models/ # Validated model presets
│ ├── gemma4-26b-a4b.env # Gemma 4 26B MoE (TP1)
│ ├── redhatai-122b-nvfp4.env # RedHatAI NVFP4 (TP1)
│ ├── intel-122b-int4.env # Intel INT4 AutoRound (TP1)
│ ├── wangzhang-122b-fp8.env # abliterated FP8 (TP2)
│ ├── wangzhang-122b-nvfp4.env # abliterated NVFP4 (TP1)
│ ├── qwen3.5-397b-int4.env # 397B INT4 (TP2)
│ ├── qwen3.5-122b-fp8.env
│ ├── qwen3.5-122b-nvfp4.env
│ └── qwen3.5-122b-nvfp4-tp2.env
├── benchmarks/ # llama-benchy benchmark results
│ ├── results_intel-int4-tp1.json
│ ├── results_wangzhang-fp8-tp2.json
│ └── results_wangzhang-nvfp4-tp1.json
├── patches/ # SM121 / PyTorch 2.11 compatibility
│ ├── fix_pytorch211_compat.py # hoist=True removal (PyTorch 2.11)
│ └── ...
└── scripts/
├── run-cluster-node.sh # Manual Ray cluster bootstrap
├── verify_imports.py # Build/runtime verification
└── verify_runtime.sh # Full GPU verification
All configuration is via .env. See .env.example for full documentation.
| Variable | Description | Example |
|---|---|---|
VLLM_IMAGE |
Docker image (local or GHCR) | ghcr.io/bjk110/vllm-spark:v020-ngc2603 |
MODEL_PATH |
Host path to model weights | /home/user/Models/Qwen/... |
MODEL_CONTAINER_PATH |
Container mount point | /models/Qwen3.5-397B-... |
SERVED_MODEL_NAME |
API model name | Qwen/Qwen3.5-397B-... |
TP_SIZE |
Tensor parallel size (1=standalone, 2+=Ray) | 2 |
VLLM_EXTRA_ARGS |
Model-specific vllm serve flags | --kv-cache-dtype fp8 --reasoning-parser qwen3 |
VLLM_MARLIN_USE_ATOMIC_ADD |
Enable for INT4 AutoRound | 1 (or empty to disable) |
The Dockerfile applies SM121 (Blackwell) compatibility patches:
| Patch | Purpose | Status |
|---|---|---|
fix_pytorch211_compat |
hoist=True removal for PyTorch 2.11 |
Active |
fastsafetensors_natural_sort |
Multi-node weight loading order fix | Active |
aot_cache_fix |
torch.fx.Node pickling fix for AOT cache | Active |
nogds_force |
Force nogds=True (GB10 has no GDS support) |
Active |
apply_sm121_patches |
is_blackwell_class, NVFP4 split, TRITON_PTXAS |
Active |
moe_config_e256/e512 |
GB10-tuned MoE kernel configs | Active |
fix_cuda13_memcpy_batch |
cuMemcpyBatchAsync API fix |
Removed (upstream) |
qwen3_5_moe_rope_fix |
RoPE validation fix | Removed (upstream) |
pr38423_nvfp4_spark |
NVFP4 DGX Spark fixes | Removed (upstream) |
All benchmarks measured with llama-benchy v0.3.4.
| Concurrency | 26B MoE (4B active) | 31B Dense |
|---|---|---|
| 1 | 25.0 (peak 26) | 4.0 (peak 5) |
| 2 | 45.9 (peak 49) | 7.9 (peak 8) |
| 4 | 67.2 (peak 77) | 14.1 (peak 17) |
| Metric | 26B MoE | 31B Dense |
|---|---|---|
| TTFT c=1 | 417 ms | 653 ms |
| KV cache | 224K tokens (51.3 GiB) | 77K tokens (35.2 GiB, FP8) |
| Concurrency | FP8 TP2 (abliterated) | INT4 TP1 (Intel) | NVFP4 TP1 (abliterated) |
|---|---|---|---|
| 1 | 31.5 (peak 32.5) | 29.7 (peak 30) | 17.0 (peak 18) |
| 2 | 42.4 (peak 54) | 57.6 (peak 59) | 33.3 (peak 35) |
| 4 | 59.7 (peak 91) | 52.1 (peak 97) | 55.2 (peak 65) |
| Metric | FP8 TP2 | INT4 TP1 | NVFP4 TP1 |
|---|---|---|---|
| TTFT c=1 | 1,989 ms | 1,098 ms | 984 ms |
| KV cache | 839K tokens (38.5 GiB/node) | 789K tokens (36.2 GiB) | 155K tokens (14.3 GiB) |
| Test | Throughput (t/s) | TTFT (ms) |
|---|---|---|
| pp512 | 967 ± 33 | 543 ± 25 |
| pp1024 | 1,349 ± 2 | 776 ± 2 |
| pp2048 | 1,704 ± 9 | 1,224 ± 7 |
| tg128 | 27.0 ± 0.1 | — |
| Concurrency | tg128 total | tg128 peak |
|---|---|---|
| 1 | 27.0 | 28 |
| 2 | 45.3 | 52 |
| 4 | 60~67 | 85~88 |
| 8 | 59~91 | 152~160 |
Experimental test preset (see Experimental: Qwen3.6-35B-A3B FP16 test preset).
Original bf16/fp16 weights, fp8 KV cache, 32K context, spark01 single-node.
| Concurrency | pp2048 total t/s | tg32 total t/s | tg32 per-req t/s | peak tg t/s |
|---|---|---|---|---|
| 1 | 3,032 ± 825 | 32.4 ± 0.1 | 32.4 | 33 |
| 2 | 4,724 ± 75 | 63.9 ± 2.2 | 32.0 | 66 |
| 3 | 4,783 ± 439 | 61.1 ± 10.8 | 21.5 | 72 |
| 4 | 5,206 ± 444 | 80.1 ± 19.2 | 22.4 | 101 |
TTFT c=1: ~746 ms (pp2048).
Recommended OS-level settings for DGX Spark:
# Reduce swap pressure (unified memory)
sudo sysctl -w vm.swappiness=10
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.confThis is an experimental test preset added for quick evaluation of the original upstream Qwen3.6 weights on a single DGX Spark. It is not a base-stack change — the main image, vLLM, FlashInfer, transformers, and CUDA versions are unchanged.
- Preset file:
models/qwen3.6-35b-fp16.env - Scope:
single DGX Spark / TP=1(designed to fit one GB10 node with headroom) - Model: original Qwen3.6-35B-A3B weights (bf16/fp16, not quantized).
--kv-cache-dtype fp8is an optional KV-cache-only optimization and does not change the model weights. - Recommended options (already in the preset):
--kv-cache-dtype fp8(KV cache compression only)--reasoning-parser qwen3--enable-chunked-prefill--enable-prefix-caching(added by the entrypoint by default)
ssh spark01 'cd ~/docker/vllm-spark && docker compose --profile head down'
ssh spark02 'cd ~/docker/vllm-spark && docker compose --profile worker down'
# Clear unified-memory residue between model switches (GB10)
ssh spark01 'sync && sudo sysctl -w vm.drop_caches=3'The model is assumed to exist at
/mnt/data/llm-models/Qwen/Qwen_Qwen3.6-35B-A3B on the homeserver. Transfer it to
the chosen Spark node (recommended: spark01, same node as the 397B head) before
launch, then point MODEL_PATH at the local copy:
# From homeserver (~67 GB, ~6 min over the RoCE link)
rsync -av /mnt/data/llm-models/Qwen/Qwen_Qwen3.6-35B-A3B/ \
spark01:/home/bjk110/Documents/Models/Qwen/Qwen_Qwen3.6-35B-A3B/
# On spark01: materialize the preset and substitute the local model root
ssh spark01 'cd ~/docker/vllm-spark && \
cp models/qwen3.6-35b-fp16.env .env && \
sed -i "s|\[model_path\]|/home/bjk110/Documents/Models/Qwen|" .env'ssh spark01 'cd ~/docker/vllm-spark && \
docker compose --env-file .env --profile head up -d'Adjust these values in qwen3.6-35b-fp16.env in this order (each step lowers
memory pressure):
GPU_MEMORY_UTILIZATION=0.80MAX_MODEL_LEN=16384MAX_NUM_SEQS=4- Only if the above still fails: consider a TP=2 variant across
spark01+spark02(no preset ships for this — this experimental preset is TP=1 only).
This repository is currently maintained with two primary branches:
-
main: the current base branch Contains the refreshed base stack, including the updated vLLM / FlashInfer / Transformers / container baseline. -
feat/turboquant-rebase-20260417: the active TurboQuant branch Used for TurboQuant-specific integration, validation, and follow-up experiments on top of the current base branch.
Older experimental branches have been cleaned up after their contents were either merged into main or superseded by the current TurboQuant rebase work.
The legacy TurboQuant branch is preserved as a tag:
archive/feat-turboquant
If needed, it can be restored with:
git checkout -b feat/turboquant archive/feat-turboquantConfiguration files are provided as-is for reference. Models are subject to their respective licenses (Qwen License).