Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 21 additions & 19 deletions inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,23 @@ This guide walks you through setting up the necessary cloud infrastructure, conf
<a name="table-of-contents"></a>
## Table of Contents

* [1. Test Environment](#test-environment)
* [2. High-Level Architecture](#architecture)
* [3. Environment Setup (One-Time)](#environment-setup)
* [3.1. Clone the Repository](#clone-repo)
* [3.2. Configure Environment Variables](#configure-vars)
* [3.3. Connect to your GKE Cluster](#connect-cluster)
* [3.4. Upload the Model Checkpoints](#upload-the-model-checkpoints)
* [3.5. Create Persistent Volumes and Persistent Volume Claims](#create-persistent-volumes-and-persistent-volume-claims)
* [3.6. Grant Storage Permissions to Kubernetes Service Account](#grant-storage-permission-to-kubernetes-service-account)
* [4. Run the Recipe](#run-the-recipe)
* [4.1. Inference benchmark for DeepSeek-R1 671B](#serving-deepseek-r1-671b)
* [5. Monitoring and Troubleshooting](#monitoring)
* [5.1. Check Deployment Status](#check-status)
* [5.2. View Logs](#view-logs)
* [6. Cleanup](#cleanup)
- [Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) and Google Cloud Storage on A4X GKE Node Pool](#single-host-model-serving-with-nvidia-tensorrt-llm-trt-llm-and-google-cloud-storage-on-a4x-gke-node-pool)
- [Table of Contents](#table-of-contents)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these needed? https://screenshot.googleplex.com/4AaDwcrTZHxo288 (Looks a bit redundant with table of contents above)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table of contents is automatically generated and updated by our formatter. I have to revert it every time I modify the file, so I was thinking of just leaving it as is. However, I can revert the change again if we want to avoid those first two lines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the first two titles and save without formatting. Could you please take a look again? Thank you!

- [1. Test Environment](#1-test-environment)
- [2. High-Level Flow](#2-high-level-flow)
- [3. Environment Setup (One-Time)](#3-environment-setup-one-time)
- [3.1. Clone the Repository](#31-clone-the-repository)
- [3.2. Configure Environment Variables](#32-configure-environment-variables)
- [3.3. Connect to your GKE Cluster](#33-connect-to-your-gke-cluster)
- [3.4 Upload the Model Checkpoints](#34-upload-the-model-checkpoints)
- [3.5 Create Persistent Volumes and Persistent Volume Claims](#35-create-persistent-volumes-and-persistent-volume-claims)
- [3.6 Grant Storage Permission to Kubernetes Service Account](#36-grant-storage-permission-to-kubernetes-service-account)
- [4. Run the recipe](#4-run-the-recipe)
- [4.1. Inference benchmark for DeepSeek-R1 671B Model](#41-inference-benchmark-for-deepseek-r1-671b-model)
- [5. Monitoring and Troubleshooting](#5-monitoring-and-troubleshooting)
- [5.1. Check Deployment Status](#51-check-deployment-status)
- [5.2. View Logs](#52-view-logs)
- [6. Cleanup](#6-cleanup)

<a name="test-environment"></a>
## 1. Test Environment
Expand All @@ -40,7 +42,7 @@ This recipe has been optimized for and tested with the following configuration:
* A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later.
* A GPU node pool with 1 [a4x-highgpu-4g](https://cloud.google.com/compute/docs/gpus) machine.
* [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
* [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
* [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. In this recipe, we will use GCSFuse CSI version `v1.22.4-gke.2`.
* [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
* [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
* Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
Expand Down Expand Up @@ -169,7 +171,7 @@ To download the model from HuggingFace, please follow the steps below:
1. [Mount the bucket](https://docs.cloud.google.com/storage/docs/cloud-storage-fuse/mount-bucket)
to your local system.
1. Access into the mount point and create the model folder.
2. Under the mount point,
2. Under the model folder,
[download](https://huggingface.co/docs/hub/en/models-downloading) the model
using the `hf` command:

Expand All @@ -183,8 +185,8 @@ to your local system.
The inference deployment accesses GCS buckets for serving model through
[the Cloud Storage FUSE CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver)
configured using Kubernetes Persistent Volumes (PV) and Persistent Volume
Claims (PVC). You must generate PVs and PVCs for serving modelbucket using the
[gcs-fuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
Claims (PVC). You must generate PVs and PVCs for serving model bucket using the
[GCSFuse helper Helm chart](../../../../src/helm-charts/storage/gcs-fuse).
The chart configures the FUSE driver settings following the best practices
for optimizing access to buckets for serving model.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,10 @@ workload:
- isl: 128
osl: 128
num_requests: 1000
gcsSidecarImage: us.gcr.io/gke-release/gcr.io/gcs-fuse-csi-driver-sidecar-mounter:v1.22.4-gke.2

network:
gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.7
ncclSettings:
- name: NCCL_DEBUG
value: "VERSION"

Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ To download the model from HuggingFace, please follow the steps below:
1. Follow these [instructions](https://docs.cloud.google.com/managed-lustre/docs/connect-from-compute-engine) to create a compute
engine and mount your Lustre instance on it.
3. Access the mount point on the compute engine and create the model folder.
4. Under the mount point,
4. Under the model folder,
[download](https://huggingface.co/docs/hub/en/models-downloading) the model
using the `hf` command:

Expand Down
6 changes: 3 additions & 3 deletions src/helm-charts/storage/gcs-fuse/templates/pv.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,9 @@ spec:
- metadata-cache:ttl-secs:-1
- metadata-cache:stat-cache-max-size-mb:-1
- metadata-cache:type-cache-max-size-mb:-1
- read_ahead_kb=1024
- file-cache:max-size-mb:0
- file-cache:enable-parallel-downloads:false
- file-system:kernel-list-cache-ttl-secs:-1
- read:enable-buffered-read:true
- read_ahead_kb=131072
{{- if $gcs.dirPath }}
- only-dir:{{ $gcs.dirPath }}
{{- end }}
Expand Down