AI-Hypercomputer · lepan-google · Mar 28, 2026 · Mar 31, 2026 · mkmg · Mar 30, 2026
diff --git a/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md b/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md
@@ -7,21 +7,23 @@ This guide walks you through setting up the necessary cloud infrastructure, conf
 <a name="table-of-contents"></a>
 ## Table of Contents
 
-* [1. Test Environment](#test-environment)
-* [2. High-Level Architecture](#architecture)
-* [3. Environment Setup (One-Time)](#environment-setup)
-  * [3.1. Clone the Repository](#clone-repo)
-  * [3.2. Configure Environment Variables](#configure-vars)
-  * [3.3. Connect to your GKE Cluster](#connect-cluster)
-  * [3.4. Upload the Model Checkpoints](#upload-the-model-checkpoints)
-  * [3.5. Create Persistent Volumes and Persistent Volume Claims](#create-persistent-volumes-and-persistent-volume-claims)
-  * [3.6. Grant Storage Permissions to Kubernetes Service Account](#grant-storage-permission-to-kubernetes-service-account)
-* [4. Run the Recipe](#run-the-recipe)
-  * [4.1. Inference benchmark for DeepSeek-R1 671B](#serving-deepseek-r1-671b)
-* [5. Monitoring and Troubleshooting](#monitoring)
-  * [5.1. Check Deployment Status](#check-status)
-  * [5.2. View Logs](#view-logs)
-* [6. Cleanup](#cleanup)
+- [Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) and Google Cloud Storage on A4X GKE Node Pool](#single-host-model-serving-with-nvidia-tensorrt-llm-trt-llm-and-google-cloud-storage-on-a4x-gke-node-pool)
+  - [Table of Contents](#table-of-contents)
+  - [1. Test Environment](#1-test-environment)
+  - [2. High-Level Flow](#2-high-level-flow)
+  - [3. Environment Setup (One-Time)](#3-environment-setup-one-time)
+    - [3.1. Clone the Repository](#31-clone-the-repository)
+    - [3.2. Configure Environment Variables](#32-configure-environment-variables)
+    - [3.3. Connect to your GKE Cluster](#33-connect-to-your-gke-cluster)
+    - [3.4 Upload the Model Checkpoints](#34-upload-the-model-checkpoints)
+    - [3.5 Create Persistent Volumes and Persistent Volume Claims](#35-create-persistent-volumes-and-persistent-volume-claims)
+    - [3.6 Grant Storage Permission to Kubernetes Service Account](#36-grant-storage-permission-to-kubernetes-service-account)
+  - [4. Run the recipe](#4-run-the-recipe)
+    - [4.1. Inference benchmark for DeepSeek-R1 671B Model](#41-inference-benchmark-for-deepseek-r1-671b-model)
+  - [5. Monitoring and Troubleshooting](#5-monitoring-and-troubleshooting)
+    - [5.1. Check Deployment Status](#51-check-deployment-status)
+    - [5.2. View Logs](#52-view-logs)
+  - [6. Cleanup](#6-cleanup)
 
 <a name="test-environment"></a>
 ## 1. Test Environment
@@ -40,7 +42,7 @@ This recipe has been optimized for and tested with the following configuration:
     * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later.
     * A GPU node pool with 1 [a4x-highgpu-4g](https://cloud.google.com/compute/docs/gpus) machine.
     * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
-    * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
+    * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. In this recipe, we will use GCSFuse CSI version `v1.22.4-gke.2`.
     * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
     * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
     * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
@@ -169,7 +171,7 @@ To download the model from HuggingFace, please follow the steps below:
 1.  [Mount the bucket](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/mount-bucket)
 to your local system.
 1.  Access into the mount point and create the model folder.
-2.  Under the mount point,
+2.  Under the model folder,
     [download](https://huggingface.co/docs/hub/en/models-downloading) the model
     using the `hf` command:
 
@@ -183,8 +185,8 @@ to your local system.
 The inference deployment accesses GCS buckets for serving model through
 [the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver)
 configured using Kubernetes Persistent Volumes (PV) and Persistent Volume
-Claims (PVC). You must generate PVs and PVCs for serving modelbucket using the
-[gcs-fuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
+Claims (PVC). You must generate PVs and PVCs for serving model bucket using the
+[GCSFuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
 The chart configures the FUSE driver settings following the best practices
 for optimizing access to buckets for serving model.
 

diff --git a/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml b/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml
@@ -50,10 +50,10 @@ workload:
       - isl: 128
         osl: 128
         num_requests: 1000
+  gcsSidecarImage: us.gcr.io/gke-release/gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v1.22.4-gke.2
 
 network:
   gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.7
   ncclSettings:
     - name: NCCL_DEBUG
       value: "VERSION"
-
diff --git a/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md b/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md
@@ -206,7 +206,7 @@ To download the model from HuggingFace, please follow the steps below:
 1.  Follow these [instructions](https://docs.cloud.google.com/managed-lustre/docs/connect-from-compute-engine) to create a compute
     engine and mount your Lustre instance on it.
 3.  Access the mount point on the compute engine and create the model folder.
-4.  Under the mount point,
+4.  Under the model folder,
     [download](https://huggingface.co/docs/hub/en/models-downloading) the model
     using the `hf` command:
 

diff --git a/src/helm-charts/storage/gcs-fuse/templates/pv.yaml b/src/helm-charts/storage/gcs-fuse/templates/pv.yaml
@@ -78,9 +78,9 @@ spec:
     - metadata-cache:ttl-secs:-1
     - metadata-cache:stat-cache-max-size-mb:-1
     - metadata-cache:type-cache-max-size-mb:-1
-    - read_ahead_kb=1024
-    - file-cache:max-size-mb:0
-    - file-cache:enable-parallel-downloads:false
+    - file-system:kernel-list-cache-ttl-secs:-1
+    - read:enable-buffered-read:true
+    - read_ahead_kb=131072
   {{- if $gcs.dirPath }}
     - only-dir:{{ $gcs.dirPath }}
   {{- end }}