diff --git a/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md b/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md
index d3bc7800..d1151af4 100644
--- a/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md
+++ b/inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md
@@ -7,21 +7,21 @@ This guide walks you through setting up the necessary cloud infrastructure, conf
## Table of Contents
-* [1. Test Environment](#test-environment)
-* [2. High-Level Architecture](#architecture)
-* [3. Environment Setup (One-Time)](#environment-setup)
- * [3.1. Clone the Repository](#clone-repo)
- * [3.2. Configure Environment Variables](#configure-vars)
- * [3.3. Connect to your GKE Cluster](#connect-cluster)
- * [3.4. Upload the Model Checkpoints](#upload-the-model-checkpoints)
- * [3.5. Create Persistent Volumes and Persistent Volume Claims](#create-persistent-volumes-and-persistent-volume-claims)
- * [3.6. Grant Storage Permissions to Kubernetes Service Account](#grant-storage-permission-to-kubernetes-service-account)
-* [4. Run the Recipe](#run-the-recipe)
- * [4.1. Inference benchmark for DeepSeek-R1 671B](#serving-deepseek-r1-671b)
-* [5. Monitoring and Troubleshooting](#monitoring)
- * [5.1. Check Deployment Status](#check-status)
- * [5.2. View Logs](#view-logs)
-* [6. Cleanup](#cleanup)
+- [1. Test Environment](#1-test-environment)
+- [2. High-Level Flow](#2-high-level-flow)
+- [3. Environment Setup (One-Time)](#3-environment-setup-one-time)
+ - [3.1. Clone the Repository](#31-clone-the-repository)
+ - [3.2. Configure Environment Variables](#32-configure-environment-variables)
+ - [3.3. Connect to your GKE Cluster](#33-connect-to-your-gke-cluster)
+ - [3.4 Upload the Model Checkpoints](#34-upload-the-model-checkpoints)
+ - [3.5 Create Persistent Volumes and Persistent Volume Claims](#35-create-persistent-volumes-and-persistent-volume-claims)
+ - [3.6 Grant Storage Permission to Kubernetes Service Account](#36-grant-storage-permission-to-kubernetes-service-account)
+- [4. Run the recipe](#4-run-the-recipe)
+ - [4.1. Inference benchmark for DeepSeek-R1 671B Model](#41-inference-benchmark-for-deepseek-r1-671b-model)
+- [5. Monitoring and Troubleshooting](#5-monitoring-and-troubleshooting)
+ - [5.1. Check Deployment Status](#51-check-deployment-status)
+ - [5.2. View Logs](#52-view-logs)
+- [6. Cleanup](#6-cleanup)
## 1. Test Environment
@@ -40,7 +40,7 @@ This recipe has been optimized for and tested with the following configuration:
* A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later.
* A GPU node pool with 1 [a4x-highgpu-4g](https://cloud.google.com/compute/docs/gpus) machine.
* [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
- * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
+ * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. In this recipe, we will use GCSFuse CSI version `v1.22.4-gke.2`.
* [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
* [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
* Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
@@ -169,7 +169,7 @@ To download the model from HuggingFace, please follow the steps below:
1. [Mount the bucket](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/mount-bucket)
to your local system.
1. Access into the mount point and create the model folder.
-2. Under the mount point,
+2. Under the model folder,
[download](https://huggingface.co/docs/hub/en/models-downloading) the model
using the `hf` command:
@@ -183,8 +183,8 @@ to your local system.
The inference deployment accesses GCS buckets for serving model through
[the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver)
configured using Kubernetes Persistent Volumes (PV) and Persistent Volume
-Claims (PVC). You must generate PVs and PVCs for serving modelbucket using the
-[gcs-fuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
+Claims (PVC). You must generate PVs and PVCs for serving model bucket using the
+[GCSFuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
The chart configures the FUSE driver settings following the best practices
for optimizing access to buckets for serving model.
diff --git a/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml b/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml
index c679e06a..e0e0ca91 100644
--- a/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml
+++ b/inference/a4x/single-host-serving/tensorrt-llm-gcs/values.yaml
@@ -50,10 +50,10 @@ workload:
- isl: 128
osl: 128
num_requests: 1000
+ gcsSidecarImage: us.gcr.io/gke-release/gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v1.22.4-gke.2
network:
gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.7
ncclSettings:
- name: NCCL_DEBUG
value: "VERSION"
-
diff --git a/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md b/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md
index eb008c39..09f9684a 100644
--- a/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md
+++ b/inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md
@@ -206,7 +206,7 @@ To download the model from HuggingFace, please follow the steps below:
1. Follow these [instructions](https://docs.cloud.google.com/managed-lustre/docs/connect-from-compute-engine) to create a compute
engine and mount your Lustre instance on it.
3. Access the mount point on the compute engine and create the model folder.
-4. Under the mount point,
+4. Under the model folder,
[download](https://huggingface.co/docs/hub/en/models-downloading) the model
using the `hf` command:
diff --git a/src/helm-charts/storage/gcs-fuse/templates/pv.yaml b/src/helm-charts/storage/gcs-fuse/templates/pv.yaml
index efafd9f9..b75473c3 100644
--- a/src/helm-charts/storage/gcs-fuse/templates/pv.yaml
+++ b/src/helm-charts/storage/gcs-fuse/templates/pv.yaml
@@ -78,9 +78,9 @@ spec:
- metadata-cache:ttl-secs:-1
- metadata-cache:stat-cache-max-size-mb:-1
- metadata-cache:type-cache-max-size-mb:-1
- - read_ahead_kb=1024
- - file-cache:max-size-mb:0
- - file-cache:enable-parallel-downloads:false
+ - file-system:kernel-list-cache-ttl-secs:-1
+ - read:enable-buffered-read:true
+ - read_ahead_kb=131072
{{- if $gcs.dirPath }}
- only-dir:{{ $gcs.dirPath }}
{{- end }}