diff --git a/inference/llm-d/README.md b/inference/llm-d/README.md new file mode 100644 index 00000000..2c40b9b4 --- /dev/null +++ b/inference/llm-d/README.md @@ -0,0 +1,216 @@ +# Disaggregated Inference with llm-d on GKE (No helm version) + +This document outlines the steps to deploy an llm-d inference server on GKE without using helm. + +## 1. Environment Setup (One-Time) + +1.1. If using A3U or A4, create an RDMA cluster following [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-with-rdma); if using A4X, create an RDMA cluster following [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom-a4x). + +1.2. Clone the repository + +```bash +git clone [https://github.com/ai-hypercomputer/gpu-recipes.git](https://github.com/ai-hypercomputer/gpu-recipes.git) +cd gpu-recipes/inference/llm-d +```` + +1.3. Configure environment variables + +``` bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export NAMESPACE= +export HF_TOKEN= +``` + +1.4. Connect to your GKE cluster + +``` bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION + +kubectl create namespace ${NAMESPACE} + +kubectl config set-context --current --namespace=$NAMESPACE +``` + +1.5. Create secrets + +``` bash +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` + +## 2\. Set up GKE Gateway + +2.1. [Enable Gateway API in your Cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#enable-gateway) + +``` bash +gcloud container clusters update $CLUSTER_NAME \ + --location=$CLUSTER_REGION \ + --gateway-api=standard +``` + +2.2. [Verify your cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#verify-internal) + +``` bash +gcloud container clusters describe $CLUSTER_NAME \ + --location=$CLUSTER_REGION \ + --format json +``` + +The output is similar to the following: + +``` json +"networkConfig": { + ... + "gatewayApiConfig": { + "channel": "CHANNEL_STANDARD" + }, + ... +}, +``` + +Confirm the `GatewayClasses` are installed in your cluster: + +``` bash +kubectl get gatewayclass +``` + +The output is similar to the following: + +``` +NAME CONTROLLER ACCEPTED AGE +gke-l7-global-external-managed networking.gke.io/gateway True 16h +gke-l7-regional-external-managed networking.gke.io/gateway True 16h +gke-l7-gxlb networking.gke.io/gateway True 16h +gke-l7-rilb networking.gke.io/gateway True 16h +``` + +2.3. [Configure a proxy-only subnet](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#configure_a_proxy-only_subnet) + +``` bash +export SUBNET_NAME= (e.g. gateway-proxy-only-subnet) +export VPC_NETWORK_NAME= (e.g. default) +export CIDR_RANGE= (e.g. 10.1.1.0/24) + +gcloud compute networks subnets create $SUBNET_NAME \ + --purpose=REGIONAL_MANAGED_PROXY \ + --role=ACTIVE \ + --region=$CLUSTER_REGION \ + --network=$VPC_NETWORK_NAME \ + --range=$CIDR_RANGE +``` + +2.4. Verify your proxy-only subnet: + +``` bash +gcloud compute networks subnets describe $SUBNET_NAME \ + --region=$CLUSTER_REGION +``` + +The output is similar to the following: + +``` +... +gatewayAddress: 10.1.1.1 +ipCidrRange: 10.1.1.0/24 +kind: compute#subnetwork +name: proxy-subnet +network: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/global/networks/default](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/global/networks/default) +privateIpGoogleAccess: false +privateIpv6GoogleAccess: DISABLE_GOOGLE_ACCESS +purpose: REGIONAL_MANAGED_PROXY +region: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION) +role: ACTIVE +selfLink: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION/subnetworks/proxy-subnet](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION/subnetworks/proxy-subnet) +state: READY +``` + +2.5. [Install needed Custom Resource Definitions (CRDs) in your GKE cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway#prepare-environment): + + * For GKE versions `1.34.0-gke.1626000` or later, install only the alpha `InferenceObjective` CRD: + + + +``` bash +kubectl apply -f [https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml) +``` + + * For GKE versions earlier than `1.34.0-gke.1626000`, install both the `v1 InferencePool` and alpha `InferenceObjective` CRDs: + + + +``` bash +kubectl apply -f [https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml) +``` + +2.6. [Deploy GKE gateway](https://github.com/llm-d/llm-d/blob/main/guides/recipes/gateway/README.md) + +``` bash +kubectl apply -f gateway.yaml -n ${NAMESPACE} +``` + +Clone the repository: + +``` bash +git clone [https://github.com/llm-d/llm-d.git](https://github.com/llm-d/llm-d.git) + +cd guides/recipes/gateway +``` + +Deploy a gateway suitable for GKE + +``` bash +kubectl apply -k ./gke-l7-regional-external-managed -n ${NAMESPACE} +``` + +2.7. Deploy the InferencePool + +``` bash +kubectl apply -f inference-pool.yaml -n ${NAMESPACE} +``` + +## 3\. Deploy the model + +Install LeaderWorkerSet: + +``` bash +VERSION=v0.8.0 +kubectl apply --server-side -f [https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml](https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml) +``` + +To wait for LeaderWorkerSet to be fully available, run: + +``` bash +kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m +``` + +### H200: + +``` bash +kubectl apply -f a3ultra/disaggregated-serving.yaml -n ${NAMESPACE} +``` + +### B200: + +``` bash +kubectl apply -f a4/disaggregated-serving.yaml -n ${NAMESPACE} +``` + +## 4\. Verify the deployment + +``` bash +export GATEWAY_IP=$(kubectl get gateway llm-d-inference-gateway -n default -o jsonpath='{.status.addresses[0].value}') + +curl http://$GATEWAY_IP/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-R1-0528", + "messages": [ + {"role": "user", "content": "Explain quantum computing in one sentence."} + ], + "max_tokens": 50 + }' + +``` diff --git a/inference/llm-d/a3ultra/disaggregated-serving.yaml b/inference/llm-d/a3ultra/disaggregated-serving.yaml new file mode 100644 index 00000000..e3228936 --- /dev/null +++ b/inference/llm-d/a3ultra/disaggregated-serving.yaml @@ -0,0 +1,478 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: deepseek-r1 +--- +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: decode + name: wide-ep-llm-d-decode +spec: + leaderWorkerTemplate: + size: 2 + workerTemplate: + metadata: + annotations: + networking.gke.io/default-interface: eth0 + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"}, + {"interfaceName":"eth6","network":"rdma-4"}, + {"interfaceName":"eth7","network":"rdma-5"}, + {"interfaceName":"eth8","network":"rdma-6"}, + {"interfaceName":"eth9","network":"rdma-7"} + ] + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: decode + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: decode + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-block + weight: 2 + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: decode + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-subblock + weight: 1 + containers: + - args: + - |- + # Clear /dev/shm on start to prevent running out of space when crashes occur + # https://github.com/llm-d/llm-d/issues/352 + find /dev/shm -type f -delete + + ################# + # RUN vLLM decode worker + ################# + START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL )) + + # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node + # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers + # --async-scheduling: Reduce white space between engine steps + # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication. + + exec vllm serve \ + deepseek-ai/DeepSeek-R1-0528 \ + --port 8200 \ + --trust-remote-code \ + --disable-uvicorn-access-log \ + --data-parallel-hybrid-lb \ + --enable-expert-parallel \ + --tensor-parallel-size $TP_SIZE \ + --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \ + --data-parallel-size-local $DP_SIZE_LOCAL \ + --data-parallel-address ${LWS_LEADER_ADDRESS} \ + --data-parallel-rpc-port 5555 \ + --data-parallel-start-rank $START_RANK \ + --kv_transfer_config '{"kv_connector":"NixlConnector", + "kv_role":"kv_both", + "kv_load_failure_policy":"fail"}' \ + --async-scheduling \ + --enable-dbo \ + --dbo-decode-token-threshold 32 \ + --enable-eplb \ + --eplb-config '{"window_size":"1000", + "step_interval":"3000", + "num_redundant_experts":"32", + "log_balancedness":"False"}' \ + --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ + --kv-cache-memory-bytes=${KV_CACHE_MEMORY_BYTES-} + command: + - /bin/bash + - -c + env: + - name: VLLM_MOE_DP_CHUNK_SIZE + value: "384" + - name: DP_SIZE_LOCAL + value: "8" + - name: TP_SIZE + value: "1" + - name: TRITON_LIBCUDA_PATH + value: /usr/lib64 + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS + value: "1" + - name: VLLM_USE_DEEP_GEMM + value: "1" + - name: VLLM_ALL2ALL_BACKEND + value: deepep_low_latency + - name: NVIDIA_GDRCOPY + value: enabled + - name: NVSHMEM_REMOTE_TRANSPORT + value: ibgda + - name: NVSHMEM_IB_ENABLE_IBGDA + value: "true" + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: eth0 + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NCCL_SOCKET_IFNAME + value: eth0 + - name: VLLM_LOGGING_LEVEL + value: INFO + - name: VLLM_NIXL_SIDE_CHANNEL_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: CUDA_CACHE_PATH + value: /var/cache/vllm/cuda + - name: CCACHE_DIR + value: /var/cache/vllm/ccache + - name: VLLM_CACHE_ROOT + value: /var/cache/vllm/vllm + - name: FLASHINFER_WORKSPACE_BASE + value: /var/cache/vllm/flashinfer + - name: HF_HUB_CACHE + value: /var/cache/huggingface + - name: DEEP_EP_DEVICE_TO_HCA_MAPPING + value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1 + - name: BASH_ENV + value: /usr/local/gib/scripts/set_nccl_env.sh + - name: NVSHMEM_DISABLED_GDRCOPY + value: "true" + image: ghcr.io/llm-d/llm-d-cuda:v0.4.0 + imagePullPolicy: Always + livenessProbe: + failureThreshold: 3 + httpGet: + path: /health + port: vllm + periodSeconds: 30 + timeoutSeconds: 5 + name: vllm + ports: + - containerPort: 8200 + name: vllm + protocol: TCP + - containerPort: 5600 + name: nixl + protocol: TCP + readinessProbe: + failureThreshold: 3 + httpGet: + path: /v1/models + port: vllm + periodSeconds: 10 + timeoutSeconds: 5 + resources: + limits: + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + requests: + cpu: 32 + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + - SYS_RAWIO + privileged: true + runAsGroup: 0 + runAsUser: 0 + startupProbe: + failureThreshold: 2700 + httpGet: + path: /health + port: vllm + initialDelaySeconds: 0 + periodSeconds: 1 + timeoutSeconds: 5 + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /var/cache/huggingface + name: hf-cache + - mountPath: /var/cache/vllm + name: jit-cache + - mountPath: /usr/local/gib + name: gib + initContainers: + - args: + - --port=8000 + - --vllm-port=8200 + - --connector=nixlv2 + - --zap-log-level=1 + - --secure-proxy=false + - --enable-prefiller-sampling + image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0 + imagePullPolicy: Always + name: routing-proxy + ports: + - containerPort: 8000 + name: sidecar + protocol: TCP + resources: {} + restartPolicy: Always + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + serviceAccountName: deepseek-r1 + volumes: + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: dshm + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/ + type: DirectoryOrCreate + name: hf-cache + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/ + type: DirectoryOrCreate + name: jit-cache + - hostPath: + path: /home/kubernetes/bin/gib + type: "" + name: gib + replicas: 1 +--- +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: prefill + name: wide-ep-llm-d-prefill +spec: + leaderWorkerTemplate: + size: 2 + workerTemplate: + metadata: + annotations: + networking.gke.io/default-interface: eth0 + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"}, + {"interfaceName":"eth6","network":"rdma-4"}, + {"interfaceName":"eth7","network":"rdma-5"}, + {"interfaceName":"eth8","network":"rdma-6"}, + {"interfaceName":"eth9","network":"rdma-7"} + ] + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: prefill + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: prefill + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-block + weight: 2 + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: prefill + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-subblock + weight: 1 + containers: + - args: + - |- + # Clear /dev/shm on start to prevent running out of space when crashes occur + # https://github.com/llm-d/llm-d/issues/352 + find /dev/shm -type f -delete + + ################# + # RUN vLLM prefill worker + ################# + START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL )) + + # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node + # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers + # --async-scheduling: Reduce white space between engine steps + # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication. + # --enable-eplb: Expert-parallel load balancing reduces EP load imbalance by replicating heavily-used experts + # Performance-memory tradeoff: on DeepSeekV3 eplb uses an extra 2GB per redundant expert per GPU + # Divisibility constraint: num_routed_experts (256 for DSv3) + num_redundant_experts must be divisible by the number of GPUs. + + exec vllm serve \ + deepseek-ai/DeepSeek-R1-0528 \ + --port 8000 \ + --trust-remote-code \ + --disable-uvicorn-access-log \ + --data-parallel-hybrid-lb \ + --enable-expert-parallel \ + --tensor-parallel-size $TP_SIZE \ + --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \ + --data-parallel-size-local $DP_SIZE_LOCAL \ + --data-parallel-address ${LWS_LEADER_ADDRESS} \ + --data-parallel-rpc-port 5555 \ + --data-parallel-start-rank $START_RANK \ + --kv_transfer_config '{"kv_connector":"NixlConnector", + "kv_role":"kv_both", + "kv_load_failure_policy":"fail"}' \ + --async-scheduling \ + --enable-dbo \ + --dbo-prefill-token-threshold 32 \ + --enable-eplb \ + --eplb-config '{"window_size":"1000", + "step_interval":"3000", + "num_redundant_experts":"32", + "log_balancedness":"False"}' \ + --gpu-memory-utilization 0.75 + command: + - /bin/bash + - -c + env: + - name: DP_SIZE_LOCAL + value: "8" + - name: TP_SIZE + value: "1" + - name: TRITON_LIBCUDA_PATH + value: /usr/lib64 + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS + value: "1" + - name: VLLM_USE_DEEP_GEMM + value: "1" + - name: VLLM_ALL2ALL_BACKEND + value: deepep_high_throughput + - name: NVIDIA_GDRCOPY + value: enabled + - name: NVSHMEM_REMOTE_TRANSPORT + value: ibgda + - name: NVSHMEM_IB_ENABLE_IBGDA + value: "true" + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: eth0 + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NCCL_SOCKET_IFNAME + value: eth0 + - name: VLLM_LOGGING_LEVEL + value: INFO + - name: VLLM_NIXL_SIDE_CHANNEL_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: CUDA_CACHE_PATH + value: /var/cache/vllm/cuda + - name: CCACHE_DIR + value: /var/cache/vllm/ccache + - name: VLLM_CACHE_ROOT + value: /var/cache/vllm/vllm + - name: FLASHINFER_WORKSPACE_BASE + value: /var/cache/vllm/flashinfer + - name: HF_HUB_CACHE + value: /var/cache/huggingface + - name: DEEP_EP_DEVICE_TO_HCA_MAPPING + value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1 + - name: BASH_ENV + value: /usr/local/gib/scripts/set_nccl_env.sh + - name: NVSHMEM_DISABLED_GDRCOPY + value: "true" + image: ghcr.io/llm-d/llm-d-cuda:v0.4.0 + imagePullPolicy: Always + livenessProbe: + failureThreshold: 3 + httpGet: + path: /health + port: vllm + periodSeconds: 30 + timeoutSeconds: 5 + name: vllm + ports: + - containerPort: 8000 + name: vllm + protocol: TCP + - containerPort: 5600 + name: nixl + protocol: TCP + readinessProbe: + failureThreshold: 3 + httpGet: + path: /v1/models + port: vllm + periodSeconds: 10 + timeoutSeconds: 5 + resources: + limits: + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + requests: + cpu: 32 + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + - SYS_RAWIO + privileged: true + runAsGroup: 0 + runAsUser: 0 + startupProbe: + failureThreshold: 2700 + httpGet: + path: /health + port: vllm + initialDelaySeconds: 0 + periodSeconds: 1 + timeoutSeconds: 5 + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /var/cache/huggingface + name: hf-cache + - mountPath: /var/cache/vllm + name: jit-cache + - mountPath: /usr/local/gib + name: gib + serviceAccountName: deepseek-r1 + volumes: + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: dshm + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/ + type: DirectoryOrCreate + name: hf-cache + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/ + type: DirectoryOrCreate + name: jit-cache + - hostPath: + path: /home/kubernetes/bin/gib + type: "" + name: gib + replicas: 1 diff --git a/inference/llm-d/a4/disaggregated-serving.yaml b/inference/llm-d/a4/disaggregated-serving.yaml new file mode 100644 index 00000000..8d1ded48 --- /dev/null +++ b/inference/llm-d/a4/disaggregated-serving.yaml @@ -0,0 +1,480 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: deepseek-r1 +--- +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: decode + name: wide-ep-llm-d-decode +spec: + leaderWorkerTemplate: + size: 2 + workerTemplate: + metadata: + annotations: + networking.gke.io/default-interface: eth0 + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"}, + {"interfaceName":"eth6","network":"rdma-4"}, + {"interfaceName":"eth7","network":"rdma-5"}, + {"interfaceName":"eth8","network":"rdma-6"}, + {"interfaceName":"eth9","network":"rdma-7"} + ] + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: decode + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: decode + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-block + weight: 2 + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: decode + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-subblock + weight: 1 + containers: + - args: + - |- + # Clear /dev/shm on start to prevent running out of space when crashes occur + # https://github.com/llm-d/llm-d/issues/352 + find /dev/shm -type f -delete + + ################# + # RUN vLLM decode worker + ################# + START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL )) + + # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node + # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers + # --async-scheduling: Reduce white space between engine steps + # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication. + + exec vllm serve \ + deepseek-ai/DeepSeek-R1-0528 \ + --port 8200 \ + --trust-remote-code \ + --disable-uvicorn-access-log \ + --data-parallel-hybrid-lb \ + --enable-expert-parallel \ + --tensor-parallel-size $TP_SIZE \ + --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \ + --data-parallel-size-local $DP_SIZE_LOCAL \ + --data-parallel-address ${LWS_LEADER_ADDRESS} \ + --data-parallel-rpc-port 5555 \ + --data-parallel-start-rank $START_RANK \ + --kv_transfer_config '{"kv_connector":"NixlConnector", + "kv_role":"kv_both", + "kv_load_failure_policy":"fail"}' \ + --async-scheduling \ + --enable-dbo \ + --dbo-decode-token-threshold 32 \ + --enable-eplb \ + --eplb-config '{"window_size":"1000", + "step_interval":"3000", + "num_redundant_experts":"32", + "log_balancedness":"False"}' \ + --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ + --kv-cache-memory-bytes=${KV_CACHE_MEMORY_BYTES-} + command: + - /bin/bash + - -c + env: + - name: VLLM_MOE_DP_CHUNK_SIZE + value: "384" + - name: DP_SIZE_LOCAL + value: "8" + - name: TP_SIZE + value: "1" + - name: TRITON_LIBCUDA_PATH + value: /usr/lib64 + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS + value: "1" + - name: VLLM_USE_DEEP_GEMM + value: "1" + - name: VLLM_ALL2ALL_BACKEND + value: deepep_low_latency + - name: NVIDIA_GDRCOPY + value: enabled + - name: NVSHMEM_REMOTE_TRANSPORT + value: ibgda + - name: NVSHMEM_IB_ENABLE_IBGDA + value: "true" + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: eth0 + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NCCL_SOCKET_IFNAME + value: eth0 + - name: VLLM_LOGGING_LEVEL + value: INFO + - name: VLLM_NIXL_SIDE_CHANNEL_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: CUDA_CACHE_PATH + value: /var/cache/vllm/cuda + - name: CCACHE_DIR + value: /var/cache/vllm/ccache + - name: VLLM_CACHE_ROOT + value: /var/cache/vllm/vllm + - name: FLASHINFER_WORKSPACE_BASE + value: /var/cache/vllm/flashinfer + - name: HF_HUB_CACHE + value: /var/cache/huggingface + - name: DEEP_EP_DEVICE_TO_HCA_MAPPING + value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1 + - name: BASH_ENV + value: /usr/local/gib/scripts/set_nccl_env.sh + - name: NVSHMEM_DISABLED_GDRCOPY + value: "true" + - name: KV_CACHE_MEMORY_BYTES + value: "63000000000" + image: ghcr.io/llm-d/llm-d-cuda:v0.4.0 + imagePullPolicy: Always + livenessProbe: + failureThreshold: 3 + httpGet: + path: /health + port: vllm + periodSeconds: 30 + timeoutSeconds: 5 + name: vllm + ports: + - containerPort: 8200 + name: vllm + protocol: TCP + - containerPort: 5600 + name: nixl + protocol: TCP + readinessProbe: + failureThreshold: 3 + httpGet: + path: /v1/models + port: vllm + periodSeconds: 10 + timeoutSeconds: 5 + resources: + limits: + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + requests: + cpu: 32 + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + - SYS_RAWIO + privileged: true + runAsGroup: 0 + runAsUser: 0 + startupProbe: + failureThreshold: 2700 + httpGet: + path: /health + port: vllm + initialDelaySeconds: 0 + periodSeconds: 1 + timeoutSeconds: 5 + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /var/cache/huggingface + name: hf-cache + - mountPath: /var/cache/vllm + name: jit-cache + - mountPath: /usr/local/gib + name: gib + initContainers: + - args: + - --port=8000 + - --vllm-port=8200 + - --connector=nixlv2 + - --zap-log-level=1 + - --secure-proxy=false + - --enable-prefiller-sampling + image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0 + imagePullPolicy: Always + name: routing-proxy + ports: + - containerPort: 8000 + name: sidecar + protocol: TCP + resources: {} + restartPolicy: Always + securityContext: + allowPrivilegeEscalation: false + runAsNonRoot: true + serviceAccountName: deepseek-r1 + volumes: + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: dshm + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/ + type: DirectoryOrCreate + name: hf-cache + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/ + type: DirectoryOrCreate + name: jit-cache + - hostPath: + path: /home/kubernetes/bin/gib + type: "" + name: gib + replicas: 1 +--- +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: prefill + name: wide-ep-llm-d-prefill +spec: + leaderWorkerTemplate: + size: 2 + workerTemplate: + metadata: + annotations: + networking.gke.io/default-interface: eth0 + networking.gke.io/interfaces: | + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth2","network":"rdma-0"}, + {"interfaceName":"eth3","network":"rdma-1"}, + {"interfaceName":"eth4","network":"rdma-2"}, + {"interfaceName":"eth5","network":"rdma-3"}, + {"interfaceName":"eth6","network":"rdma-4"}, + {"interfaceName":"eth7","network":"rdma-5"}, + {"interfaceName":"eth8","network":"rdma-6"}, + {"interfaceName":"eth9","network":"rdma-7"} + ] + labels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: DeepSeek-R1-0528 + llm-d.ai/role: prefill + spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: prefill + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-block + weight: 2 + - podAffinityTerm: + labelSelector: + matchLabels: + llm-d.ai/role: prefill + matchLabelKeys: + - component + topologyKey: cloud.google.com/gce-topology-subblock + weight: 1 + containers: + - args: + - |- + # Clear /dev/shm on start to prevent running out of space when crashes occur + # https://github.com/llm-d/llm-d/issues/352 + find /dev/shm -type f -delete + + ################# + # RUN vLLM prefill worker + ################# + START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL )) + + # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node + # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers + # --async-scheduling: Reduce white space between engine steps + # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication. + # --enable-eplb: Expert-parallel load balancing reduces EP load imbalance by replicating heavily-used experts + # Performance-memory tradeoff: on DeepSeekV3 eplb uses an extra 2GB per redundant expert per GPU + # Divisibility constraint: num_routed_experts (256 for DSv3) + num_redundant_experts must be divisible by the number of GPUs. + + exec vllm serve \ + deepseek-ai/DeepSeek-R1-0528 \ + --port 8000 \ + --trust-remote-code \ + --disable-uvicorn-access-log \ + --data-parallel-hybrid-lb \ + --enable-expert-parallel \ + --tensor-parallel-size $TP_SIZE \ + --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \ + --data-parallel-size-local $DP_SIZE_LOCAL \ + --data-parallel-address ${LWS_LEADER_ADDRESS} \ + --data-parallel-rpc-port 5555 \ + --data-parallel-start-rank $START_RANK \ + --kv_transfer_config '{"kv_connector":"NixlConnector", + "kv_role":"kv_both", + "kv_load_failure_policy":"fail"}' \ + --async-scheduling \ + --enable-dbo \ + --dbo-prefill-token-threshold 32 \ + --enable-eplb \ + --eplb-config '{"window_size":"1000", + "step_interval":"3000", + "num_redundant_experts":"32", + "log_balancedness":"False"}' \ + --gpu-memory-utilization 0.75 + command: + - /bin/bash + - -c + env: + - name: DP_SIZE_LOCAL + value: "8" + - name: TP_SIZE + value: "1" + - name: TRITON_LIBCUDA_PATH + value: /usr/lib64 + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS + value: "1" + - name: VLLM_USE_DEEP_GEMM + value: "1" + - name: VLLM_ALL2ALL_BACKEND + value: deepep_high_throughput + - name: NVIDIA_GDRCOPY + value: enabled + - name: NVSHMEM_REMOTE_TRANSPORT + value: ibgda + - name: NVSHMEM_IB_ENABLE_IBGDA + value: "true" + - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME + value: eth0 + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NCCL_SOCKET_IFNAME + value: eth0 + - name: VLLM_LOGGING_LEVEL + value: INFO + - name: VLLM_NIXL_SIDE_CHANNEL_HOST + valueFrom: + fieldRef: + fieldPath: status.podIP + - name: CUDA_CACHE_PATH + value: /var/cache/vllm/cuda + - name: CCACHE_DIR + value: /var/cache/vllm/ccache + - name: VLLM_CACHE_ROOT + value: /var/cache/vllm/vllm + - name: FLASHINFER_WORKSPACE_BASE + value: /var/cache/vllm/flashinfer + - name: HF_HUB_CACHE + value: /var/cache/huggingface + - name: DEEP_EP_DEVICE_TO_HCA_MAPPING + value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1 + - name: BASH_ENV + value: /usr/local/gib/scripts/set_nccl_env.sh + - name: NVSHMEM_DISABLED_GDRCOPY + value: "true" + image: ghcr.io/llm-d/llm-d-cuda:v0.4.0 + imagePullPolicy: Always + livenessProbe: + failureThreshold: 3 + httpGet: + path: /health + port: vllm + periodSeconds: 30 + timeoutSeconds: 5 + name: vllm + ports: + - containerPort: 8000 + name: vllm + protocol: TCP + - containerPort: 5600 + name: nixl + protocol: TCP + readinessProbe: + failureThreshold: 3 + httpGet: + path: /v1/models + port: vllm + periodSeconds: 10 + timeoutSeconds: 5 + resources: + limits: + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + requests: + cpu: 32 + ephemeral-storage: 1Ti + memory: 512Gi + nvidia.com/gpu: "8" + securityContext: + capabilities: + add: + - IPC_LOCK + - SYS_RAWIO + privileged: true + runAsGroup: 0 + runAsUser: 0 + startupProbe: + failureThreshold: 2700 + httpGet: + path: /health + port: vllm + initialDelaySeconds: 0 + periodSeconds: 1 + timeoutSeconds: 5 + volumeMounts: + - mountPath: /dev/shm + name: dshm + - mountPath: /var/cache/huggingface + name: hf-cache + - mountPath: /var/cache/vllm + name: jit-cache + - mountPath: /usr/local/gib + name: gib + serviceAccountName: deepseek-r1 + volumes: + - emptyDir: + medium: Memory + sizeLimit: 2Gi + name: dshm + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/ + type: DirectoryOrCreate + name: hf-cache + - hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/ + type: DirectoryOrCreate + name: jit-cache + - hostPath: + path: /home/kubernetes/bin/gib + type: "" + name: gib + replicas: 1 diff --git a/inference/llm-d/gateway.yaml b/inference/llm-d/gateway.yaml new file mode 100644 index 00000000..ef2ebb9d --- /dev/null +++ b/inference/llm-d/gateway.yaml @@ -0,0 +1,33 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: Gateway +metadata: + name: llm-d-inference-gateway +spec: + gatewayClassName: gke-l7-regional-external-managed + listeners: + - allowedRoutes: + namespaces: + from: All + name: default + port: 80 + protocol: HTTP +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-d-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: llm-d-inference-gateway + rules: + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: llm-d-infpool + port: 8000 + matches: + - path: + type: PathPrefix + value: / diff --git a/inference/llm-d/inference-pool.yaml b/inference/llm-d/inference-pool.yaml new file mode 100644 index 00000000..7cf92236 --- /dev/null +++ b/inference/llm-d/inference-pool.yaml @@ -0,0 +1,265 @@ +--- +# Source: inferencepool/templates/rbac.yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: llm-d-infpool-epp + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +--- +# Source: inferencepool/templates/epp-config.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: llm-d-infpool-epp +data: + default-plugins.yaml: | + apiVersion: inference.networking.x-k8s.io/v1alpha1 + kind: EndpointPickerConfig + plugins: + - type: queue-scorer + - type: kv-cache-utilization-scorer + - type: prefix-cache-scorer + schedulingProfiles: + - name: default + plugins: + - pluginRef: queue-scorer + weight: 2 + - pluginRef: kv-cache-utilization-scorer + weight: 2 + - pluginRef: prefix-cache-scorer + weight: 3 + custom-plugins.yaml: | + # ALWAYS DO PD IN THIS EXAMPLE (THRESHOLD 0) + # This example uses random routing + # since it's not yet possible to route to individual DP ranks + apiVersion: inference.networking.x-k8s.io/v1alpha1 + kind: EndpointPickerConfig + plugins: + - type: prefill-header-handler + - type: prefill-filter + - type: decode-filter + - type: random-picker + parameters: + maxNumOfEndpoints: 1 + - type: pd-profile-handler + parameters: + threshold: 0 + hashBlockSize: 5 + schedulingProfiles: + - name: prefill + plugins: + - pluginRef: prefill-filter + - pluginRef: random-picker + - name: decode + plugins: + - pluginRef: decode-filter + - pluginRef: random-picker +--- +# Source: inferencepool/templates/rbac.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: llm-d-infpool-epp + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +rules: +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferenceobjectives", "inferencemodelrewrites"] + verbs: ["get", "watch", "list"] +- apiGroups: ["inference.networking.k8s.io"] + resources: ["inferencepools"] + verbs: ["get", "watch", "list"] +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "watch", "list"] +--- +# Source: inferencepool/templates/rbac.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: llm-d-infpool-epp +subjects: +- kind: ServiceAccount + name: llm-d-infpool-epp +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: llm-d-infpool-epp +--- +# Source: inferencepool/templates/epp-service.yaml +apiVersion: v1 +kind: Service +metadata: + name: llm-d-infpool-epp + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +spec: + selector: + inferencepool: llm-d-infpool-epp + ports: + - name: grpc-ext-proc + protocol: TCP + port: 9002 + - name: http-metrics + protocol: TCP + port: 9090 + type: ClusterIP +--- +# Source: inferencepool/templates/epp-deployment.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llm-d-infpool-epp + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +spec: + replicas: 1 + strategy: + # The current recommended EPP deployment pattern is to have a single active replica. This ensures + # optimal performance of the stateful operations such prefix cache aware scorer. + # The Recreate strategy the old replica is killed immediately, and allow the new replica(s) to + # quickly take over. This is particularly important in the high availability set up with leader + # election, as the rolling update strategy would prevent the old leader being killed because + # otherwise the maxUnavailable would be 100%. + type: Recreate + selector: + matchLabels: + inferencepool: llm-d-infpool-epp + template: + metadata: + labels: + inferencepool: llm-d-infpool-epp + spec: + serviceAccountName: llm-d-infpool-epp + # Conservatively, this timeout should mirror the longest grace period of the pods within the pool + terminationGracePeriodSeconds: 130 + containers: + - name: epp + image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0 + imagePullPolicy: Always + args: + - --pool-name + - llm-d-infpool + # The pool namespace is optional because EPP can default to the NAMESPACE env var. + # We still keep this here so that the template works with older versions of EPP, or other + # distros of EPP which may not have implemented the NAMESPACE env var defaulting behavior. + - --pool-namespace + - default + - --pool-group + - "inference.networking.k8s.io" + - --zap-encoder + - "json" + - --config-file + - "/config/custom-plugins.yaml" + # Pass additional flags via the inferenceExtension.flags field in values.yaml. + - --kv-cache-usage-percentage-metric + - "vllm:kv_cache_usage_perc" + - --v + - "1" + - --tracing=false + ports: + - name: grpc + containerPort: 9002 + - name: grpc-health + containerPort: 9003 + - name: metrics + containerPort: 9090 + livenessProbe: + grpc: + port: 9003 + service: inference-extension + initialDelaySeconds: 5 + periodSeconds: 10 + readinessProbe: + grpc: + port: 9003 + service: inference-extension + periodSeconds: 2 + env: + - name: NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + + volumeMounts: + - name: plugins-config-volume + mountPath: "/config" + + volumes: + - name: plugins-config-volume + configMap: + name: llm-d-infpool-epp +--- +# Source: inferencepool/templates/epp-config.yaml +--- +--- +# Source: inferencepool/templates/gke.yaml +apiVersion: networking.gke.io/v1 +kind: GCPBackendPolicy +metadata: + name: llm-d-infpool + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +spec: + targetRef: + group: "inference.networking.k8s.io" + kind: InferencePool + name: llm-d-infpool + default: + timeoutSec: 300 # 5-minute timeout (adjust as needed) + logging: + enabled: true # log all requests by default +--- +# Source: inferencepool/templates/gke.yaml +kind: HealthCheckPolicy +apiVersion: networking.gke.io/v1 +metadata: + name: llm-d-infpool + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +spec: + targetRef: + group: "inference.networking.k8s.io" + kind: InferencePool + name: llm-d-infpool + default: + # Set a more aggressive health check than the default 5s for faster switch + # over during EPP rollout. + timeoutSec: 2 + checkIntervalSec: 2 + config: + type: HTTP + httpHealthCheck: + requestPath: /health + port: 8000 +--- +# Source: inferencepool/templates/inferencepool.yaml +apiVersion: "inference.networking.k8s.io/v1" +kind: InferencePool +metadata: + name: llm-d-infpool + labels: + app.kubernetes.io/name: llm-d-infpool-epp + app.kubernetes.io/version: "v1.3.0" +spec: + targetPorts: + - number: 8000 + selector: + matchLabels: + llm-d.ai/inferenceServing: "true" + llm-d.ai/model: "DeepSeek-R1-0528" + endpointPickerRef: + name: llm-d-infpool-epp + port: + number: 9002