-
Notifications
You must be signed in to change notification settings - Fork 72
Added recipe for Qwen3-235b #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ankitkumar-quad
wants to merge
11
commits into
AI-Hypercomputer:main
Choose a base branch
from
ankitkumar-quad:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+170
−0
Open
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
e428e11
Added recipe for Qwen3-235b
ankitkumar-quad 47f7068
Update READEME.md
ankitkumar-quad b07a2df
Update READEME.md
ankitkumar-quad 567597e
Update READEME.md
ankitkumar-quad 261d7fc
Update READEME.md
ankitkumar-quad ab16df4
Update READEME.md
ankitkumar-quad 0a96978
Update READEME.md
ankitkumar-quad 64edd9d
Merge branch 'AI-Hypercomputer:main' into main
ankitkumar-quad bac106f
Update READEME.md
ankitkumar-quad 7099485
Update READEME.md
ankitkumar-quad 18934ec
Update READEME.md
ankitkumar-quad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
156 changes: 156 additions & 0 deletions
156
inference/g4/qwen3_235b/single-host-serving/tensorrt-llm/READEME.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| # Single host inference benchmark of Qwen3-235B with TensorRT-LLM on G4 | ||
|
|
||
| This recipe shows how to serve and benchmark the Qwen3-4B model using [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) on a single GCP VM with G4 GPUs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). | ||
|
|
||
| ## Before you begin | ||
|
|
||
| ### 1. Create a GCP VM with G4 GPUs | ||
|
|
||
| First, we will create a Google Cloud Platform (GCP) Virtual Machine (VM) that has the necessary GPU resources. | ||
|
|
||
| Make sure you have the following prerequisites: | ||
| * [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) is initialized. | ||
| * You have a project with a GPU quota. See [Request a quota increase](https://cloud.google.com/docs/quota/view-request#requesting_higher_quota). | ||
| * [Enable required APIs](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com). | ||
|
|
||
| The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-48` for a single GPU VM. The boot disk is set to 200GB to accommodate the models and dependencies. | ||
ankitkumar-quad marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| export VM_NAME="${USER}-g4-trtllm-qwen3-235b" | ||
| export PROJECT_ID="your-project-id" | ||
| export ZONE="your-zone" | ||
| export MACHINE_TYPE="g4-standard-48" | ||
| export IMAGE_PROJECT="ubuntu-os-accelerator-images" | ||
| export IMAGE_FAMILY="ubuntu-accelerator-2404-amd64-with-nvidia-570" | ||
|
|
||
| gcloud compute instances create ${VM_NAME} \ | ||
| --machine-type=${MACHINE_TYPE} \ | ||
| --project=${PROJECT_ID} \ | ||
| --zone=${ZONE} \ | ||
| --image-project=${IMAGE_PROJECT} \ | ||
| --image-family=${IMAGE_FAMILY} \ | ||
| --maintenance-policy=TERMINATE \ | ||
| --boot-disk-size=200GB | ||
| ``` | ||
|
|
||
| ### 2. Connect to the VM | ||
|
|
||
| Use `gcloud compute ssh` to connect to the newly created instance. | ||
|
|
||
| ```bash | ||
| gcloud compute ssh ${VM_NAME?} --project=${PROJECT_ID?} --zone=${ZONE?} | ||
| ``` | ||
|
|
||
| ```bash | ||
| # Run NVIDIA smi to verify the driver installation and see the available GPUs. | ||
| nvidia-smi | ||
| ``` | ||
|
|
||
| ## Serve a model | ||
|
|
||
| ### 1. Install Docker | ||
|
|
||
| Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu: | ||
| [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/) | ||
|
|
||
| After installing Docker, make sure the Docker daemon is running. | ||
|
|
||
| ### 2. Install NVIDIA Container Toolkit | ||
|
|
||
| To enable Docker containers to access the GPU, you need to install the NVIDIA Container Toolkit. | ||
|
|
||
| You can follow the official NVIDIA documentation to install the container toolkit: | ||
| [NVIDIA Container Toolkit Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) | ||
|
|
||
| ### 3. Setup TensorRT-LLM | ||
|
|
||
| ```bash | ||
| sudo apt-get update | ||
| sudo apt-get -y install git git-lfs | ||
|
|
||
| git clone https://github.com/NVIDIA/TensorRT-LLM.git | ||
| cd TensorRT-LLM | ||
| git checkout v1.2.0rc3 | ||
| git submodule update --init --recursive | ||
| git lfs install | ||
| git lfs pull | ||
|
|
||
| # Build the Docker image | ||
| make -C docker release_build | ||
|
|
||
| # Run the Docker container | ||
| mkdir -p /scratch/cache | ||
| make -C docker release_run DOCKER_RUN_ARGS="-v /scratch:/scratch -v /scratch/cache:/root/.cache --ipc=host" | ||
| ``` | ||
|
|
||
| Now you are inside the container. | ||
|
|
||
| ### 4. Download and Quantize the Model | ||
ankitkumar-quad marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
ankitkumar-quad marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```bash | ||
| # Inside the container | ||
|
|
||
| # Download the base model from Hugging Face | ||
| apt-get update && apt-get install -y huggingface-cli | ||
|
|
||
| huggingface-cli download Qwen/Qwen3-235B --local-dir /scratch/models/Qwen3-235B | ||
|
|
||
| # Quantize the model using FP8 | ||
| python examples/llm_ptq/hf_ptq.py \ | ||
| --pyt_ckpt_path /scratch/models/Qwen3-235B \ | ||
| --qformat fp8 \ | ||
| --export_path /scratch/models/exported_model_qwen3_235b_fp8 \ | ||
| --trust_remote_code | ||
| ``` | ||
|
|
||
| ## Run Benchmarks | ||
|
|
||
| Create a script to run the benchmarks with different configurations. | ||
|
|
||
| ```bash | ||
| # Inside the container | ||
|
|
||
| cat << 'EOF' > /scratch/run_benchmark.sh | ||
| #!/bin/bash | ||
|
|
||
| # Function to run benchmarks | ||
| run_benchmark() { | ||
| local model_name=$1 | ||
| local isl=$2 | ||
| local osl=$3 | ||
| local num_requests=$4 | ||
| local tp_size=$5 | ||
| local pp_size=$6 | ||
| local ep_size=$7 | ||
|
|
||
| echo "Running benchmark for $model_name with ISL=$isl, OSL=$osl, TP=$tp_size, PP=$pp_size, EP=$ep_size" | ||
|
|
||
| dataset_file="/scratch/token-norm-dist_${model_name##*/}_${isl}_${osl}.json" | ||
|
|
||
| python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file | ||
|
|
||
| # Save throughput output to a file | ||
| trtllm-bench --model $model_name --model_path ${model_name} throughput --concurrency 128 --dataset $dataset_file --tp $tp_size --pp $pp_size --ep $ep_size --backend pytorch > "/scratch/output_${model_name##*/}_isl${isl}_osl${osl}_tp${tp_size}_pp${pp_size}_ep${ep_size}_throughput.txt" | ||
|
|
||
| rm -f $dataset_file | ||
| } | ||
|
|
||
| model_name="/scratch/models/exported_model_qwen3_235b_fp8" | ||
| TP_SIZE=1 | ||
| PP_SIZE=1 | ||
| EP_SIZE=1 | ||
|
|
||
ankitkumar-quad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| run_benchmark "$model_name" 128 128 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 128 2048 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 128 4096 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 500 2000 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 1000 1000 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 2048 128 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 2048 2048 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 5000 500 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| run_benchmark "$model_name" 20000 2000 1024 $TP_SIZE $PP_SIZE $EP_SIZE | ||
| EOF | ||
|
|
||
| chmod +x /scratch/run_benchmark.sh | ||
| /scratch/run_benchmark.sh | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.