diff --git a/inference/llm-d/README.md b/inference/llm-d/README.md
new file mode 100644
index 00000000..2c40b9b4
--- /dev/null
+++ b/inference/llm-d/README.md
@@ -0,0 +1,216 @@
+# Disaggregated Inference with llm-d on GKE (No helm version)
+
+This document outlines the steps to deploy an llm-d inference server on GKE without using helm.
+
+## 1. Environment Setup (One-Time)
+
+1.1. If using A3U or A4, create an RDMA cluster following [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom#create-with-rdma); if using A4X, create an RDMA cluster following [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute-custom-a4x).
+
+1.2. Clone the repository
+
+```bash
+git clone [https://github.com/ai-hypercomputer/gpu-recipes.git](https://github.com/ai-hypercomputer/gpu-recipes.git)
+cd gpu-recipes/inference/llm-d
+````
+
+1.3. Configure environment variables
+
+``` bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<REGION_of_your_cluster>
+export CLUSTER_NAME=<YOUR_GKE_CLUSTER_NAME>
+export NAMESPACE=<YOUR_k8s_NAMESPACE>
+export HF_TOKEN=<YOUR_HF_TOKEN>
+```
+
+1.4. Connect to your GKE cluster
+
+``` bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+
+kubectl create namespace ${NAMESPACE}
+
+kubectl config set-context --current --namespace=$NAMESPACE
+```
+
+1.5. Create secrets
+
+``` bash
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+## 2\. Set up GKE Gateway
+
+2.1. [Enable Gateway API in your Cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#enable-gateway)
+
+``` bash
+gcloud container clusters update $CLUSTER_NAME \
+    --location=$CLUSTER_REGION \
+    --gateway-api=standard
+```
+
+2.2. [Verify your cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#verify-internal)
+
+``` bash
+gcloud container clusters describe $CLUSTER_NAME \
+  --location=$CLUSTER_REGION \
+  --format json
+```
+
+The output is similar to the following:
+
+``` json
+"networkConfig": {
+  ...
+  "gatewayApiConfig": {
+    "channel": "CHANNEL_STANDARD"
+  },
+  ...
+},
+```
+
+Confirm the `GatewayClasses` are installed in your cluster:
+
+``` bash
+kubectl get gatewayclass
+```
+
+The output is similar to the following:
+
+``` 
+NAME                             CONTROLLER                  ACCEPTED   AGE
+gke-l7-global-external-managed   networking.gke.io/gateway   True       16h
+gke-l7-regional-external-managed networking.gke.io/gateway   True       16h
+gke-l7-gxlb                      networking.gke.io/gateway   True       16h
+gke-l7-rilb                      networking.gke.io/gateway   True       16h
+```
+
+2.3. [Configure a proxy-only subnet](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways#configure_a_proxy-only_subnet)
+
+``` bash
+export SUBNET_NAME=<YOUR_NAME_OF_THE_PROXY_ONLY_SUBNET> (e.g. gateway-proxy-only-subnet)
+export VPC_NETWORK_NAME=<YOUR_NAME_OF_THE_VPC_NETWORK_IN_WHICH_YOU_CREATE_THE_SUBNET> (e.g. default)
+export CIDR_RANGE=<YOUR_PRIMARY_IP_ADDRESS_RANGE_OF_THE_SUBNET> (e.g. 10.1.1.0/24)
+
+gcloud compute networks subnets create $SUBNET_NAME \
+    --purpose=REGIONAL_MANAGED_PROXY \
+    --role=ACTIVE \
+    --region=$CLUSTER_REGION \
+    --network=$VPC_NETWORK_NAME \
+    --range=$CIDR_RANGE
+```
+
+2.4. Verify your proxy-only subnet:
+
+``` bash
+gcloud compute networks subnets describe $SUBNET_NAME \
+    --region=$CLUSTER_REGION
+```
+
+The output is similar to the following:
+
+``` 
+...
+gatewayAddress: 10.1.1.1
+ipCidrRange: 10.1.1.0/24
+kind: compute#subnetwork
+name: proxy-subnet
+network: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/global/networks/default](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/global/networks/default)
+privateIpGoogleAccess: false
+privateIpv6GoogleAccess: DISABLE_GOOGLE_ACCESS
+purpose: REGIONAL_MANAGED_PROXY
+region: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION)
+role: ACTIVE
+selfLink: [https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION/subnetworks/proxy-subnet](https://www.googleapis.com/compute/v1/projects/PROJECT_NAME/regions/REGION/subnetworks/proxy-subnet)
+state: READY
+```
+
+2.5. [Install needed Custom Resource Definitions (CRDs) in your GKE cluster](https://docs.cloud.google.com/kubernetes-engine/docs/how-to/deploy-gke-inference-gateway#prepare-environment):
+
+  * For GKE versions `1.34.0-gke.1626000` or later, install only the alpha `InferenceObjective` CRD:
+
+<!-- end list -->
+
+``` bash
+kubectl apply -f [https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml)
+```
+
+  * For GKE versions earlier than `1.34.0-gke.1626000`, install both the `v1 InferencePool` and alpha `InferenceObjective` CRDs:
+
+<!-- end list -->
+
+``` bash
+kubectl apply -f [https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml)
+```
+
+2.6. [Deploy GKE gateway](https://github.com/llm-d/llm-d/blob/main/guides/recipes/gateway/README.md)
+
+``` bash
+kubectl apply -f gateway.yaml -n ${NAMESPACE}
+```
+
+Clone the repository:
+
+``` bash
+git clone [https://github.com/llm-d/llm-d.git](https://github.com/llm-d/llm-d.git)
+
+cd guides/recipes/gateway
+```
+
+Deploy a gateway suitable for GKE
+
+``` bash
+kubectl apply -k ./gke-l7-regional-external-managed -n ${NAMESPACE}
+```
+
+2.7. Deploy the InferencePool
+
+``` bash
+kubectl apply -f inference-pool.yaml -n ${NAMESPACE}
+```
+
+## 3\. Deploy the model
+
+Install LeaderWorkerSet:
+
+``` bash
+VERSION=v0.8.0
+kubectl apply --server-side -f [https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml](https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml)
+```
+
+To wait for LeaderWorkerSet to be fully available, run:
+
+``` bash
+kubectl wait deploy/lws-controller-manager -n lws-system --for=condition=available --timeout=5m
+```
+
+### H200:
+
+``` bash
+kubectl apply -f a3ultra/disaggregated-serving.yaml -n ${NAMESPACE}
+```
+
+### B200:
+
+``` bash
+kubectl apply -f a4/disaggregated-serving.yaml -n ${NAMESPACE}
+```
+
+## 4\. Verify the deployment
+
+``` bash
+export GATEWAY_IP=$(kubectl get gateway llm-d-inference-gateway -n default -o jsonpath='{.status.addresses[0].value}')
+
+curl http://$GATEWAY_IP/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-R1-0528",
+    "messages": [
+      {"role": "user", "content": "Explain quantum computing in one sentence."}
+    ],
+    "max_tokens": 50
+  }'
+
+```
diff --git a/inference/llm-d/a3ultra/disaggregated-serving.yaml b/inference/llm-d/a3ultra/disaggregated-serving.yaml
new file mode 100644
index 00000000..e3228936
--- /dev/null
+++ b/inference/llm-d/a3ultra/disaggregated-serving.yaml
@@ -0,0 +1,478 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: deepseek-r1
+---
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  labels:
+    llm-d.ai/inferenceServing: "true"
+    llm-d.ai/model: DeepSeek-R1-0528
+    llm-d.ai/role: decode
+  name: wide-ep-llm-d-decode
+spec:
+  leaderWorkerTemplate:
+    size: 2
+    workerTemplate:
+      metadata:
+        annotations:
+          networking.gke.io/default-interface: eth0
+          networking.gke.io/interfaces: |
+            [
+              {"interfaceName":"eth0","network":"default"},
+              {"interfaceName":"eth2","network":"rdma-0"},
+              {"interfaceName":"eth3","network":"rdma-1"},
+              {"interfaceName":"eth4","network":"rdma-2"},
+              {"interfaceName":"eth5","network":"rdma-3"},
+              {"interfaceName":"eth6","network":"rdma-4"},
+              {"interfaceName":"eth7","network":"rdma-5"},
+              {"interfaceName":"eth8","network":"rdma-6"},
+              {"interfaceName":"eth9","network":"rdma-7"}
+            ]
+        labels:
+          llm-d.ai/inferenceServing: "true"
+          llm-d.ai/model: DeepSeek-R1-0528
+          llm-d.ai/role: decode
+      spec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: decode
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-block
+              weight: 2
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: decode
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-subblock
+              weight: 1
+        containers:
+        - args:
+          - |-
+            # Clear /dev/shm on start to prevent running out of space when crashes occur
+            # https://github.com/llm-d/llm-d/issues/352
+            find /dev/shm -type f -delete
+
+            #################
+            # RUN vLLM decode worker
+            #################
+            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
+
+            # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node
+            # --enable-expert-parallel:  Use TPxDP in attention, EP in MoE layers
+            # --async-scheduling: Reduce white space between engine steps
+            # --enable-dbo:  Dual batch overlap (DBO) overlaps compute with collective communication.
+
+            exec vllm serve \
+              deepseek-ai/DeepSeek-R1-0528 \
+              --port 8200 \
+              --trust-remote-code \
+              --disable-uvicorn-access-log \
+              --data-parallel-hybrid-lb \
+              --enable-expert-parallel \
+              --tensor-parallel-size $TP_SIZE \
+              --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
+              --data-parallel-size-local $DP_SIZE_LOCAL \
+              --data-parallel-address ${LWS_LEADER_ADDRESS} \
+              --data-parallel-rpc-port 5555 \
+              --data-parallel-start-rank $START_RANK \
+              --kv_transfer_config '{"kv_connector":"NixlConnector",
+                                      "kv_role":"kv_both",
+                                      "kv_load_failure_policy":"fail"}' \
+              --async-scheduling \
+              --enable-dbo \
+              --dbo-decode-token-threshold 32 \
+              --enable-eplb \
+              --eplb-config '{"window_size":"1000",
+                              "step_interval":"3000",
+                              "num_redundant_experts":"32",
+                              "log_balancedness":"False"}' \
+              --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
+              --kv-cache-memory-bytes=${KV_CACHE_MEMORY_BYTES-}
+          command:
+          - /bin/bash
+          - -c
+          env:
+          - name: VLLM_MOE_DP_CHUNK_SIZE
+            value: "384"
+          - name: DP_SIZE_LOCAL
+            value: "8"
+          - name: TP_SIZE
+            value: "1"
+          - name: TRITON_LIBCUDA_PATH
+            value: /usr/lib64
+          - name: VLLM_SKIP_P2P_CHECK
+            value: "1"
+          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
+            value: "1"
+          - name: VLLM_USE_DEEP_GEMM
+            value: "1"
+          - name: VLLM_ALL2ALL_BACKEND
+            value: deepep_low_latency
+          - name: NVIDIA_GDRCOPY
+            value: enabled
+          - name: NVSHMEM_REMOTE_TRANSPORT
+            value: ibgda
+          - name: NVSHMEM_IB_ENABLE_IBGDA
+            value: "true"
+          - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+            value: eth0
+          - name: GLOO_SOCKET_IFNAME
+            value: eth0
+          - name: NCCL_SOCKET_IFNAME
+            value: eth0
+          - name: VLLM_LOGGING_LEVEL
+            value: INFO
+          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
+            valueFrom:
+              fieldRef:
+                fieldPath: status.podIP
+          - name: CUDA_CACHE_PATH
+            value: /var/cache/vllm/cuda
+          - name: CCACHE_DIR
+            value: /var/cache/vllm/ccache
+          - name: VLLM_CACHE_ROOT
+            value: /var/cache/vllm/vllm
+          - name: FLASHINFER_WORKSPACE_BASE
+            value: /var/cache/vllm/flashinfer
+          - name: HF_HUB_CACHE
+            value: /var/cache/huggingface
+          - name: DEEP_EP_DEVICE_TO_HCA_MAPPING
+            value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1
+          - name: BASH_ENV
+            value: /usr/local/gib/scripts/set_nccl_env.sh
+          - name: NVSHMEM_DISABLED_GDRCOPY
+            value: "true"
+          image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
+          imagePullPolicy: Always
+          livenessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /health
+              port: vllm
+            periodSeconds: 30
+            timeoutSeconds: 5
+          name: vllm
+          ports:
+          - containerPort: 8200
+            name: vllm
+            protocol: TCP
+          - containerPort: 5600
+            name: nixl
+            protocol: TCP
+          readinessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /v1/models
+              port: vllm
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            limits:
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+            requests:
+              cpu: 32
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+          securityContext:
+            capabilities:
+              add:
+              - IPC_LOCK
+              - SYS_RAWIO
+            privileged: true
+            runAsGroup: 0
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 2700
+            httpGet:
+              path: /health
+              port: vllm
+            initialDelaySeconds: 0
+            periodSeconds: 1
+            timeoutSeconds: 5
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          - mountPath: /var/cache/huggingface
+            name: hf-cache
+          - mountPath: /var/cache/vllm
+            name: jit-cache
+          - mountPath: /usr/local/gib
+            name: gib
+        initContainers:
+        - args:
+          - --port=8000
+          - --vllm-port=8200
+          - --connector=nixlv2
+          - --zap-log-level=1
+          - --secure-proxy=false
+          - --enable-prefiller-sampling
+          image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
+          imagePullPolicy: Always
+          name: routing-proxy
+          ports:
+          - containerPort: 8000
+            name: sidecar
+            protocol: TCP
+          resources: {}
+          restartPolicy: Always
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+        serviceAccountName: deepseek-r1
+        volumes:
+        - emptyDir:
+            medium: Memory
+            sizeLimit: 2Gi
+          name: dshm
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/
+            type: DirectoryOrCreate
+          name: hf-cache
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/
+            type: DirectoryOrCreate
+          name: jit-cache
+        - hostPath:
+            path: /home/kubernetes/bin/gib
+            type: ""
+          name: gib
+  replicas: 1
+---
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  labels:
+    llm-d.ai/inferenceServing: "true"
+    llm-d.ai/model: DeepSeek-R1-0528
+    llm-d.ai/role: prefill
+  name: wide-ep-llm-d-prefill
+spec:
+  leaderWorkerTemplate:
+    size: 2
+    workerTemplate:
+      metadata:
+        annotations:
+          networking.gke.io/default-interface: eth0
+          networking.gke.io/interfaces: |
+            [
+              {"interfaceName":"eth0","network":"default"},
+              {"interfaceName":"eth2","network":"rdma-0"},
+              {"interfaceName":"eth3","network":"rdma-1"},
+              {"interfaceName":"eth4","network":"rdma-2"},
+              {"interfaceName":"eth5","network":"rdma-3"},
+              {"interfaceName":"eth6","network":"rdma-4"},
+              {"interfaceName":"eth7","network":"rdma-5"},
+              {"interfaceName":"eth8","network":"rdma-6"},
+              {"interfaceName":"eth9","network":"rdma-7"}
+            ]
+        labels:
+          llm-d.ai/inferenceServing: "true"
+          llm-d.ai/model: DeepSeek-R1-0528
+          llm-d.ai/role: prefill
+      spec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: prefill
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-block
+              weight: 2
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: prefill
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-subblock
+              weight: 1
+        containers:
+        - args:
+          - |-
+            # Clear /dev/shm on start to prevent running out of space when crashes occur
+            # https://github.com/llm-d/llm-d/issues/352
+            find /dev/shm -type f -delete
+
+            #################
+            # RUN vLLM prefill worker
+            #################
+            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
+
+            # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node
+            # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers
+            # --async-scheduling: Reduce white space between engine steps
+            # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication.
+            # --enable-eplb:  Expert-parallel load balancing reduces EP load imbalance by replicating heavily-used experts
+            #   Performance-memory tradeoff: on DeepSeekV3 eplb uses an extra 2GB per redundant expert per GPU
+            #   Divisibility constraint: num_routed_experts (256 for DSv3) + num_redundant_experts must be divisible by the number of GPUs.
+
+            exec vllm serve \
+              deepseek-ai/DeepSeek-R1-0528 \
+              --port 8000 \
+              --trust-remote-code \
+              --disable-uvicorn-access-log \
+              --data-parallel-hybrid-lb \
+              --enable-expert-parallel \
+              --tensor-parallel-size $TP_SIZE \
+              --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
+              --data-parallel-size-local $DP_SIZE_LOCAL \
+              --data-parallel-address ${LWS_LEADER_ADDRESS} \
+              --data-parallel-rpc-port 5555 \
+              --data-parallel-start-rank $START_RANK \
+              --kv_transfer_config '{"kv_connector":"NixlConnector",
+                                      "kv_role":"kv_both",
+                                      "kv_load_failure_policy":"fail"}' \
+              --async-scheduling \
+              --enable-dbo \
+              --dbo-prefill-token-threshold 32 \
+              --enable-eplb \
+              --eplb-config '{"window_size":"1000",
+                              "step_interval":"3000",
+                              "num_redundant_experts":"32",
+                              "log_balancedness":"False"}' \
+              --gpu-memory-utilization 0.75
+          command:
+          - /bin/bash
+          - -c
+          env:
+          - name: DP_SIZE_LOCAL
+            value: "8"
+          - name: TP_SIZE
+            value: "1"
+          - name: TRITON_LIBCUDA_PATH
+            value: /usr/lib64
+          - name: VLLM_SKIP_P2P_CHECK
+            value: "1"
+          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
+            value: "1"
+          - name: VLLM_USE_DEEP_GEMM
+            value: "1"
+          - name: VLLM_ALL2ALL_BACKEND
+            value: deepep_high_throughput
+          - name: NVIDIA_GDRCOPY
+            value: enabled
+          - name: NVSHMEM_REMOTE_TRANSPORT
+            value: ibgda
+          - name: NVSHMEM_IB_ENABLE_IBGDA
+            value: "true"
+          - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+            value: eth0
+          - name: GLOO_SOCKET_IFNAME
+            value: eth0
+          - name: NCCL_SOCKET_IFNAME
+            value: eth0
+          - name: VLLM_LOGGING_LEVEL
+            value: INFO
+          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
+            valueFrom:
+              fieldRef:
+                fieldPath: status.podIP
+          - name: CUDA_CACHE_PATH
+            value: /var/cache/vllm/cuda
+          - name: CCACHE_DIR
+            value: /var/cache/vllm/ccache
+          - name: VLLM_CACHE_ROOT
+            value: /var/cache/vllm/vllm
+          - name: FLASHINFER_WORKSPACE_BASE
+            value: /var/cache/vllm/flashinfer
+          - name: HF_HUB_CACHE
+            value: /var/cache/huggingface
+          - name: DEEP_EP_DEVICE_TO_HCA_MAPPING
+            value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1
+          - name: BASH_ENV
+            value: /usr/local/gib/scripts/set_nccl_env.sh
+          - name: NVSHMEM_DISABLED_GDRCOPY
+            value: "true"
+          image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
+          imagePullPolicy: Always
+          livenessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /health
+              port: vllm
+            periodSeconds: 30
+            timeoutSeconds: 5
+          name: vllm
+          ports:
+          - containerPort: 8000
+            name: vllm
+            protocol: TCP
+          - containerPort: 5600
+            name: nixl
+            protocol: TCP
+          readinessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /v1/models
+              port: vllm
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            limits:
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+            requests:
+              cpu: 32
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+          securityContext:
+            capabilities:
+              add:
+              - IPC_LOCK
+              - SYS_RAWIO
+            privileged: true
+            runAsGroup: 0
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 2700
+            httpGet:
+              path: /health
+              port: vllm
+            initialDelaySeconds: 0
+            periodSeconds: 1
+            timeoutSeconds: 5
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          - mountPath: /var/cache/huggingface
+            name: hf-cache
+          - mountPath: /var/cache/vllm
+            name: jit-cache
+          - mountPath: /usr/local/gib
+            name: gib
+        serviceAccountName: deepseek-r1
+        volumes:
+        - emptyDir:
+            medium: Memory
+            sizeLimit: 2Gi
+          name: dshm
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/
+            type: DirectoryOrCreate
+          name: hf-cache
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/
+            type: DirectoryOrCreate
+          name: jit-cache
+        - hostPath:
+            path: /home/kubernetes/bin/gib
+            type: ""
+          name: gib
+  replicas: 1
diff --git a/inference/llm-d/a4/disaggregated-serving.yaml b/inference/llm-d/a4/disaggregated-serving.yaml
new file mode 100644
index 00000000..8d1ded48
--- /dev/null
+++ b/inference/llm-d/a4/disaggregated-serving.yaml
@@ -0,0 +1,480 @@
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: deepseek-r1
+---
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  labels:
+    llm-d.ai/inferenceServing: "true"
+    llm-d.ai/model: DeepSeek-R1-0528
+    llm-d.ai/role: decode
+  name: wide-ep-llm-d-decode
+spec:
+  leaderWorkerTemplate:
+    size: 2
+    workerTemplate:
+      metadata:
+        annotations:
+          networking.gke.io/default-interface: eth0
+          networking.gke.io/interfaces: |
+            [
+              {"interfaceName":"eth0","network":"default"},
+              {"interfaceName":"eth2","network":"rdma-0"},
+              {"interfaceName":"eth3","network":"rdma-1"},
+              {"interfaceName":"eth4","network":"rdma-2"},
+              {"interfaceName":"eth5","network":"rdma-3"},
+              {"interfaceName":"eth6","network":"rdma-4"},
+              {"interfaceName":"eth7","network":"rdma-5"},
+              {"interfaceName":"eth8","network":"rdma-6"},
+              {"interfaceName":"eth9","network":"rdma-7"}
+            ]
+        labels:
+          llm-d.ai/inferenceServing: "true"
+          llm-d.ai/model: DeepSeek-R1-0528
+          llm-d.ai/role: decode
+      spec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: decode
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-block
+              weight: 2
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: decode
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-subblock
+              weight: 1
+        containers:
+        - args:
+          - |-
+            # Clear /dev/shm on start to prevent running out of space when crashes occur
+            # https://github.com/llm-d/llm-d/issues/352
+            find /dev/shm -type f -delete
+
+            #################
+            # RUN vLLM decode worker
+            #################
+            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
+
+            # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node
+            # --enable-expert-parallel:  Use TPxDP in attention, EP in MoE layers
+            # --async-scheduling: Reduce white space between engine steps
+            # --enable-dbo:  Dual batch overlap (DBO) overlaps compute with collective communication.
+
+            exec vllm serve \
+              deepseek-ai/DeepSeek-R1-0528 \
+              --port 8200 \
+              --trust-remote-code \
+              --disable-uvicorn-access-log \
+              --data-parallel-hybrid-lb \
+              --enable-expert-parallel \
+              --tensor-parallel-size $TP_SIZE \
+              --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
+              --data-parallel-size-local $DP_SIZE_LOCAL \
+              --data-parallel-address ${LWS_LEADER_ADDRESS} \
+              --data-parallel-rpc-port 5555 \
+              --data-parallel-start-rank $START_RANK \
+              --kv_transfer_config '{"kv_connector":"NixlConnector",
+                                      "kv_role":"kv_both",
+                                      "kv_load_failure_policy":"fail"}' \
+              --async-scheduling \
+              --enable-dbo \
+              --dbo-decode-token-threshold 32 \
+              --enable-eplb \
+              --eplb-config '{"window_size":"1000",
+                              "step_interval":"3000",
+                              "num_redundant_experts":"32",
+                              "log_balancedness":"False"}' \
+              --compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
+              --kv-cache-memory-bytes=${KV_CACHE_MEMORY_BYTES-}
+          command:
+          - /bin/bash
+          - -c
+          env:
+          - name: VLLM_MOE_DP_CHUNK_SIZE
+            value: "384"
+          - name: DP_SIZE_LOCAL
+            value: "8"
+          - name: TP_SIZE
+            value: "1"
+          - name: TRITON_LIBCUDA_PATH
+            value: /usr/lib64
+          - name: VLLM_SKIP_P2P_CHECK
+            value: "1"
+          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
+            value: "1"
+          - name: VLLM_USE_DEEP_GEMM
+            value: "1"
+          - name: VLLM_ALL2ALL_BACKEND
+            value: deepep_low_latency
+          - name: NVIDIA_GDRCOPY
+            value: enabled
+          - name: NVSHMEM_REMOTE_TRANSPORT
+            value: ibgda
+          - name: NVSHMEM_IB_ENABLE_IBGDA
+            value: "true"
+          - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+            value: eth0
+          - name: GLOO_SOCKET_IFNAME
+            value: eth0
+          - name: NCCL_SOCKET_IFNAME
+            value: eth0
+          - name: VLLM_LOGGING_LEVEL
+            value: INFO
+          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
+            valueFrom:
+              fieldRef:
+                fieldPath: status.podIP
+          - name: CUDA_CACHE_PATH
+            value: /var/cache/vllm/cuda
+          - name: CCACHE_DIR
+            value: /var/cache/vllm/ccache
+          - name: VLLM_CACHE_ROOT
+            value: /var/cache/vllm/vllm
+          - name: FLASHINFER_WORKSPACE_BASE
+            value: /var/cache/vllm/flashinfer
+          - name: HF_HUB_CACHE
+            value: /var/cache/huggingface
+          - name: DEEP_EP_DEVICE_TO_HCA_MAPPING
+            value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1
+          - name: BASH_ENV
+            value: /usr/local/gib/scripts/set_nccl_env.sh
+          - name: NVSHMEM_DISABLED_GDRCOPY
+            value: "true"
+          - name: KV_CACHE_MEMORY_BYTES
+            value: "63000000000"
+          image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
+          imagePullPolicy: Always
+          livenessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /health
+              port: vllm
+            periodSeconds: 30
+            timeoutSeconds: 5
+          name: vllm
+          ports:
+          - containerPort: 8200
+            name: vllm
+            protocol: TCP
+          - containerPort: 5600
+            name: nixl
+            protocol: TCP
+          readinessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /v1/models
+              port: vllm
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            limits:
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+            requests:
+              cpu: 32
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+          securityContext:
+            capabilities:
+              add:
+              - IPC_LOCK
+              - SYS_RAWIO
+            privileged: true
+            runAsGroup: 0
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 2700
+            httpGet:
+              path: /health
+              port: vllm
+            initialDelaySeconds: 0
+            periodSeconds: 1
+            timeoutSeconds: 5
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          - mountPath: /var/cache/huggingface
+            name: hf-cache
+          - mountPath: /var/cache/vllm
+            name: jit-cache
+          - mountPath: /usr/local/gib
+            name: gib
+        initContainers:
+        - args:
+          - --port=8000
+          - --vllm-port=8200
+          - --connector=nixlv2
+          - --zap-log-level=1
+          - --secure-proxy=false
+          - --enable-prefiller-sampling
+          image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.4.0
+          imagePullPolicy: Always
+          name: routing-proxy
+          ports:
+          - containerPort: 8000
+            name: sidecar
+            protocol: TCP
+          resources: {}
+          restartPolicy: Always
+          securityContext:
+            allowPrivilegeEscalation: false
+            runAsNonRoot: true
+        serviceAccountName: deepseek-r1
+        volumes:
+        - emptyDir:
+            medium: Memory
+            sizeLimit: 2Gi
+          name: dshm
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/
+            type: DirectoryOrCreate
+          name: hf-cache
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/
+            type: DirectoryOrCreate
+          name: jit-cache
+        - hostPath:
+            path: /home/kubernetes/bin/gib
+            type: ""
+          name: gib
+  replicas: 1
+---
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  labels:
+    llm-d.ai/inferenceServing: "true"
+    llm-d.ai/model: DeepSeek-R1-0528
+    llm-d.ai/role: prefill
+  name: wide-ep-llm-d-prefill
+spec:
+  leaderWorkerTemplate:
+    size: 2
+    workerTemplate:
+      metadata:
+        annotations:
+          networking.gke.io/default-interface: eth0
+          networking.gke.io/interfaces: |
+            [
+              {"interfaceName":"eth0","network":"default"},
+              {"interfaceName":"eth2","network":"rdma-0"},
+              {"interfaceName":"eth3","network":"rdma-1"},
+              {"interfaceName":"eth4","network":"rdma-2"},
+              {"interfaceName":"eth5","network":"rdma-3"},
+              {"interfaceName":"eth6","network":"rdma-4"},
+              {"interfaceName":"eth7","network":"rdma-5"},
+              {"interfaceName":"eth8","network":"rdma-6"},
+              {"interfaceName":"eth9","network":"rdma-7"}
+            ]
+        labels:
+          llm-d.ai/inferenceServing: "true"
+          llm-d.ai/model: DeepSeek-R1-0528
+          llm-d.ai/role: prefill
+      spec:
+        affinity:
+          podAffinity:
+            preferredDuringSchedulingIgnoredDuringExecution:
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: prefill
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-block
+              weight: 2
+            - podAffinityTerm:
+                labelSelector:
+                  matchLabels:
+                    llm-d.ai/role: prefill
+                matchLabelKeys:
+                - component
+                topologyKey: cloud.google.com/gce-topology-subblock
+              weight: 1
+        containers:
+        - args:
+          - |-
+            # Clear /dev/shm on start to prevent running out of space when crashes occur
+            # https://github.com/llm-d/llm-d/issues/352
+            find /dev/shm -type f -delete
+
+            #################
+            # RUN vLLM prefill worker
+            #################
+            START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
+
+            # --data-parallel-hybrid-lb: Use exernal load balancing across nodes, and internal load balancing within a node
+            # --enable-expert-parallel: Use TPxDP in attention, EP in MoE layers
+            # --async-scheduling: Reduce white space between engine steps
+            # --enable-dbo: Dual batch overlap (DBO) overlaps compute with collective communication.
+            # --enable-eplb:  Expert-parallel load balancing reduces EP load imbalance by replicating heavily-used experts
+            #   Performance-memory tradeoff: on DeepSeekV3 eplb uses an extra 2GB per redundant expert per GPU
+            #   Divisibility constraint: num_routed_experts (256 for DSv3) + num_redundant_experts must be divisible by the number of GPUs.
+
+            exec vllm serve \
+              deepseek-ai/DeepSeek-R1-0528 \
+              --port 8000 \
+              --trust-remote-code \
+              --disable-uvicorn-access-log \
+              --data-parallel-hybrid-lb \
+              --enable-expert-parallel \
+              --tensor-parallel-size $TP_SIZE \
+              --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
+              --data-parallel-size-local $DP_SIZE_LOCAL \
+              --data-parallel-address ${LWS_LEADER_ADDRESS} \
+              --data-parallel-rpc-port 5555 \
+              --data-parallel-start-rank $START_RANK \
+              --kv_transfer_config '{"kv_connector":"NixlConnector",
+                                      "kv_role":"kv_both",
+                                      "kv_load_failure_policy":"fail"}' \
+              --async-scheduling \
+              --enable-dbo \
+              --dbo-prefill-token-threshold 32 \
+              --enable-eplb \
+              --eplb-config '{"window_size":"1000",
+                              "step_interval":"3000",
+                              "num_redundant_experts":"32",
+                              "log_balancedness":"False"}' \
+              --gpu-memory-utilization 0.75
+          command:
+          - /bin/bash
+          - -c
+          env:
+          - name: DP_SIZE_LOCAL
+            value: "8"
+          - name: TP_SIZE
+            value: "1"
+          - name: TRITON_LIBCUDA_PATH
+            value: /usr/lib64
+          - name: VLLM_SKIP_P2P_CHECK
+            value: "1"
+          - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
+            value: "1"
+          - name: VLLM_USE_DEEP_GEMM
+            value: "1"
+          - name: VLLM_ALL2ALL_BACKEND
+            value: deepep_high_throughput
+          - name: NVIDIA_GDRCOPY
+            value: enabled
+          - name: NVSHMEM_REMOTE_TRANSPORT
+            value: ibgda
+          - name: NVSHMEM_IB_ENABLE_IBGDA
+            value: "true"
+          - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+            value: eth0
+          - name: GLOO_SOCKET_IFNAME
+            value: eth0
+          - name: NCCL_SOCKET_IFNAME
+            value: eth0
+          - name: VLLM_LOGGING_LEVEL
+            value: INFO
+          - name: VLLM_NIXL_SIDE_CHANNEL_HOST
+            valueFrom:
+              fieldRef:
+                fieldPath: status.podIP
+          - name: CUDA_CACHE_PATH
+            value: /var/cache/vllm/cuda
+          - name: CCACHE_DIR
+            value: /var/cache/vllm/ccache
+          - name: VLLM_CACHE_ROOT
+            value: /var/cache/vllm/vllm
+          - name: FLASHINFER_WORKSPACE_BASE
+            value: /var/cache/vllm/flashinfer
+          - name: HF_HUB_CACHE
+            value: /var/cache/huggingface
+          - name: DEEP_EP_DEVICE_TO_HCA_MAPPING
+            value: 0:mlx5_0:1,1:mlx5_1:1,2:mlx5_2:1,3:mlx5_3:1,4:mlx5_4:1,5:mlx5_5:1,6:mlx5_6:1,7:mlx5_7:1
+          - name: BASH_ENV
+            value: /usr/local/gib/scripts/set_nccl_env.sh
+          - name: NVSHMEM_DISABLED_GDRCOPY
+            value: "true"
+          image: ghcr.io/llm-d/llm-d-cuda:v0.4.0
+          imagePullPolicy: Always
+          livenessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /health
+              port: vllm
+            periodSeconds: 30
+            timeoutSeconds: 5
+          name: vllm
+          ports:
+          - containerPort: 8000
+            name: vllm
+            protocol: TCP
+          - containerPort: 5600
+            name: nixl
+            protocol: TCP
+          readinessProbe:
+            failureThreshold: 3
+            httpGet:
+              path: /v1/models
+              port: vllm
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            limits:
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+            requests:
+              cpu: 32
+              ephemeral-storage: 1Ti
+              memory: 512Gi
+              nvidia.com/gpu: "8"
+          securityContext:
+            capabilities:
+              add:
+              - IPC_LOCK
+              - SYS_RAWIO
+            privileged: true
+            runAsGroup: 0
+            runAsUser: 0
+          startupProbe:
+            failureThreshold: 2700
+            httpGet:
+              path: /health
+              port: vllm
+            initialDelaySeconds: 0
+            periodSeconds: 1
+            timeoutSeconds: 5
+          volumeMounts:
+          - mountPath: /dev/shm
+            name: dshm
+          - mountPath: /var/cache/huggingface
+            name: hf-cache
+          - mountPath: /var/cache/vllm
+            name: jit-cache
+          - mountPath: /usr/local/gib
+            name: gib
+        serviceAccountName: deepseek-r1
+        volumes:
+        - emptyDir:
+            medium: Memory
+            sizeLimit: 2Gi
+          name: dshm
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-hf-cache/
+            type: DirectoryOrCreate
+          name: hf-cache
+        - hostPath:
+            path: /mnt/stateful_partition/kube-ephemeral-ssd/shared_disk/vllm-jit-cache/
+            type: DirectoryOrCreate
+          name: jit-cache
+        - hostPath:
+            path: /home/kubernetes/bin/gib
+            type: ""
+          name: gib
+  replicas: 1
diff --git a/inference/llm-d/gateway.yaml b/inference/llm-d/gateway.yaml
new file mode 100644
index 00000000..ef2ebb9d
--- /dev/null
+++ b/inference/llm-d/gateway.yaml
@@ -0,0 +1,33 @@
+apiVersion: gateway.networking.k8s.io/v1
+kind: Gateway
+metadata:
+  name: llm-d-inference-gateway
+spec:
+  gatewayClassName: gke-l7-regional-external-managed
+  listeners:
+  - allowedRoutes:
+      namespaces:
+        from: All
+    name: default
+    port: 80
+    protocol: HTTP
+---
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: llm-d-route
+spec:
+  parentRefs:
+  - group: gateway.networking.k8s.io
+    kind: Gateway
+    name: llm-d-inference-gateway
+  rules:
+  - backendRefs:
+    - group: inference.networking.k8s.io
+      kind: InferencePool
+      name: llm-d-infpool
+      port: 8000
+    matches:
+    - path:
+        type: PathPrefix
+        value: /
diff --git a/inference/llm-d/inference-pool.yaml b/inference/llm-d/inference-pool.yaml
new file mode 100644
index 00000000..7cf92236
--- /dev/null
+++ b/inference/llm-d/inference-pool.yaml
@@ -0,0 +1,265 @@
+---
+# Source: inferencepool/templates/rbac.yaml
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: llm-d-infpool-epp
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+---
+# Source: inferencepool/templates/epp-config.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: llm-d-infpool-epp
+data:
+  default-plugins.yaml: |
+    apiVersion: inference.networking.x-k8s.io/v1alpha1
+    kind: EndpointPickerConfig
+    plugins:
+    - type: queue-scorer
+    - type: kv-cache-utilization-scorer
+    - type: prefix-cache-scorer
+    schedulingProfiles:
+    - name: default
+      plugins:
+      - pluginRef: queue-scorer
+        weight: 2
+      - pluginRef: kv-cache-utilization-scorer
+        weight: 2
+      - pluginRef: prefix-cache-scorer
+        weight: 3
+  custom-plugins.yaml: |
+    # ALWAYS DO PD IN THIS EXAMPLE (THRESHOLD 0)
+    # This example uses random routing
+    # since it's not yet possible to route to individual DP ranks
+    apiVersion: inference.networking.x-k8s.io/v1alpha1
+    kind: EndpointPickerConfig
+    plugins:
+    - type: prefill-header-handler
+    - type: prefill-filter
+    - type: decode-filter
+    - type: random-picker
+      parameters:
+        maxNumOfEndpoints: 1
+    - type: pd-profile-handler
+      parameters:
+        threshold: 0
+        hashBlockSize: 5
+    schedulingProfiles:
+    - name: prefill
+      plugins:
+      - pluginRef: prefill-filter
+      - pluginRef: random-picker
+    - name: decode
+      plugins:
+      - pluginRef: decode-filter
+      - pluginRef: random-picker
+---
+# Source: inferencepool/templates/rbac.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: llm-d-infpool-epp
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+rules:
+- apiGroups: ["inference.networking.x-k8s.io"]
+  resources: ["inferenceobjectives", "inferencemodelrewrites"]
+  verbs: ["get", "watch", "list"]
+- apiGroups: ["inference.networking.k8s.io"]
+  resources: ["inferencepools"]
+  verbs: ["get", "watch", "list"]
+- apiGroups: [""]
+  resources: ["pods"]
+  verbs: ["get", "watch", "list"]
+---
+# Source: inferencepool/templates/rbac.yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: llm-d-infpool-epp
+subjects:
+- kind: ServiceAccount
+  name: llm-d-infpool-epp
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: llm-d-infpool-epp
+---
+# Source: inferencepool/templates/epp-service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: llm-d-infpool-epp
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+spec:
+  selector:
+    inferencepool: llm-d-infpool-epp
+  ports:
+    - name: grpc-ext-proc
+      protocol: TCP
+      port: 9002
+    - name: http-metrics
+      protocol: TCP
+      port: 9090
+  type: ClusterIP
+---
+# Source: inferencepool/templates/epp-deployment.yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llm-d-infpool-epp
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+spec:
+  replicas: 1
+  strategy:
+    # The current recommended EPP deployment pattern is to have a single active replica. This ensures
+    # optimal performance of the stateful operations such prefix cache aware scorer.
+    # The Recreate strategy the old replica is killed immediately, and allow the new replica(s) to
+    # quickly take over. This is particularly important in the high availability set up with leader
+    # election, as the rolling update strategy would prevent the old leader being killed because
+    # otherwise the maxUnavailable would be 100%.
+    type: Recreate
+  selector:
+    matchLabels:
+      inferencepool: llm-d-infpool-epp
+  template:
+    metadata:
+      labels:
+        inferencepool: llm-d-infpool-epp
+    spec:
+      serviceAccountName: llm-d-infpool-epp
+      # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
+      terminationGracePeriodSeconds: 130
+      containers:
+      - name: epp
+        image: ghcr.io/llm-d/llm-d-inference-scheduler:v0.4.0
+        imagePullPolicy: Always
+        args:
+        - --pool-name
+        - llm-d-infpool
+        # The pool namespace is optional because EPP can default to the NAMESPACE env var.
+        # We still keep this here so that the template works with older versions of EPP, or other
+        # distros of EPP which may not have implemented the NAMESPACE env var defaulting behavior.
+        - --pool-namespace
+        - default
+        - --pool-group
+        - "inference.networking.k8s.io"
+        - --zap-encoder
+        - "json"
+        - --config-file
+        - "/config/custom-plugins.yaml"
+        # Pass additional flags via the inferenceExtension.flags field in values.yaml.
+        - --kv-cache-usage-percentage-metric
+        - "vllm:kv_cache_usage_perc"
+        - --v
+        - "1"
+        - --tracing=false
+        ports:
+        - name: grpc
+          containerPort: 9002
+        - name: grpc-health
+          containerPort: 9003
+        - name: metrics
+          containerPort: 9090
+        livenessProbe:
+          grpc:
+            port: 9003
+            service: inference-extension
+          initialDelaySeconds: 5
+          periodSeconds: 10
+        readinessProbe:
+          grpc:
+            port: 9003
+            service: inference-extension
+          periodSeconds: 2
+        env:
+        - name: NAMESPACE
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.namespace
+        - name: POD_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+
+        volumeMounts:
+        - name: plugins-config-volume
+          mountPath: "/config"
+
+      volumes:
+      - name: plugins-config-volume
+        configMap:
+          name: llm-d-infpool-epp
+---
+# Source: inferencepool/templates/epp-config.yaml
+---
+---
+# Source: inferencepool/templates/gke.yaml
+apiVersion: networking.gke.io/v1
+kind: GCPBackendPolicy
+metadata:
+  name: llm-d-infpool
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+spec:
+  targetRef:
+    group: "inference.networking.k8s.io"
+    kind: InferencePool
+    name: llm-d-infpool
+  default:
+    timeoutSec: 300    # 5-minute timeout (adjust as needed)
+    logging:
+      enabled: true    # log all requests by default
+---
+# Source: inferencepool/templates/gke.yaml
+kind: HealthCheckPolicy
+apiVersion: networking.gke.io/v1
+metadata:
+  name: llm-d-infpool
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+spec:
+  targetRef:
+    group: "inference.networking.k8s.io"
+    kind: InferencePool
+    name: llm-d-infpool
+  default:
+    # Set a more aggressive health check than the default 5s for faster switch
+    # over during EPP rollout.
+    timeoutSec: 2
+    checkIntervalSec: 2
+    config:
+      type: HTTP
+      httpHealthCheck:
+          requestPath: /health
+          port:  8000
+---
+# Source: inferencepool/templates/inferencepool.yaml
+apiVersion: "inference.networking.k8s.io/v1"
+kind: InferencePool
+metadata:
+  name: llm-d-infpool
+  labels:
+    app.kubernetes.io/name: llm-d-infpool-epp
+    app.kubernetes.io/version: "v1.3.0"
+spec:
+  targetPorts:
+      - number: 8000
+  selector:
+    matchLabels:
+      llm-d.ai/inferenceServing: "true"
+      llm-d.ai/model: "DeepSeek-R1-0528"
+  endpointPickerRef:
+    name: llm-d-infpool-epp
+    port:
+      number: 9002