diff --git a/inference/a4/single-host-serving/sglang/README.md b/inference/a4/single-host-serving/sglang/deepseek-r1-671b/README.md similarity index 99% rename from inference/a4/single-host-serving/sglang/README.md rename to inference/a4/single-host-serving/sglang/deepseek-r1-671b/README.md index c9b775e6..2ad691a2 100644 --- a/inference/a4/single-host-serving/sglang/README.md +++ b/inference/a4/single-host-serving/sglang/deepseek-r1-671b/README.md @@ -126,7 +126,7 @@ First, you'll configure your local environment. These steps are required once be git clone https://github.com/ai-hypercomputer/gpu-recipes.git cd gpu-recipes export REPO_ROOT=$(pwd) -export RECIPE_ROOT=$REPO_ROOT/inference/a4/single-host-serving/sglang +export RECIPE_ROOT=$REPO_ROOT/inference/a4/single-host-serving/sglang/deepseek-r1-671b ``` @@ -450,4 +450,4 @@ To avoid incurring further charges, clean up the resources you created. 3. (Optional) Delete the built Docker image from Artifact Registry if no longer needed. 4. (Optional) Delete Cloud Build logs. 5. (Optional) Clean up files in your GCS bucket if benchmarking was performed. -6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. \ No newline at end of file +6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. diff --git a/inference/a4/single-host-serving/sglang/stream_chat.sh b/inference/a4/single-host-serving/sglang/deepseek-r1-671b/stream_chat.sh similarity index 100% rename from inference/a4/single-host-serving/sglang/stream_chat.sh rename to inference/a4/single-host-serving/sglang/deepseek-r1-671b/stream_chat.sh diff --git a/inference/a4/single-host-serving/sglang/values.yaml b/inference/a4/single-host-serving/sglang/deepseek-r1-671b/values.yaml similarity index 98% rename from inference/a4/single-host-serving/sglang/values.yaml rename to inference/a4/single-host-serving/sglang/deepseek-r1-671b/values.yaml index cb057b58..204dc428 100644 --- a/inference/a4/single-host-serving/sglang/values.yaml +++ b/inference/a4/single-host-serving/sglang/deepseek-r1-671b/values.yaml @@ -59,4 +59,4 @@ network: gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5 ncclSettings: - name: NCCL_DEBUG - value: "WARN" \ No newline at end of file + value: "WARN" diff --git a/inference/a4/single-host-serving/sglang/wan2.2/README.md b/inference/a4/single-host-serving/sglang/wan2.2/README.md new file mode 100644 index 00000000..90edfc5f --- /dev/null +++ b/inference/a4/single-host-serving/sglang/wan2.2/README.md @@ -0,0 +1,443 @@ +# Single host inference benchmark of Wan2.2 with Sglang on A4 GKE Node Pool + +This document outlines the steps to serve and benchmark Wan2.2 model using the [SGLang](https://github.com/sgl-project/sglang/tree/main) framework on a single [A4 GKE Node pool](https://cloud.google.com/kubernetes-engine). + +This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and benchmarking a high-performance video generation model. + + +## Table of Contents + +* [1. Test Environment](#test-environment) +* [2. High-Level Architecture](#architecture) +* [3. Environment Setup (One-Time)](#environment-setup) + * [3.1. Clone the Repository](#clone-repo) + * [3.2. Configure Environment Variables](#configure-vars) + * [3.3. Connect to your GKE Cluster](#connect-cluster) + * [3.4. Get Hugging Face Token](#get-hf-token) + * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret) +* [4. Run the Recipe](#run-the-recipe) + * [4.1. Model varients](#serving-wan-model) +* [5. Monitoring and Troubleshooting](#monitoring) + * [5.1. Check Deployment Status](#check-status) + * [5.2. View Logs](#view-logs) + * [5.3. Common Issues](#troubleshooting) +* [6. Cleanup](#cleanup) + + +## 1. Test Environment + +[Back to Top](#table-of-contents) + +The recipe uses the following setup: + +* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the SGLang framework. + +This recipe has been optimized for and tested with the following configuration: + +* **GKE Cluster**: + * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.31.7-gke.1265000` or later. + * A GPU node pool with 1 [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) machine. + * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. + * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. + * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. + * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. + * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). +* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. + +> [!IMPORTANT] +> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md). +> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**. + + +## 2. High-Level Flow + +[Back to Top](#table-of-contents) + +Here is a simplified diagram of the flow that we follow in this recipe: + +```mermaid +--- +config: + layout: dagre +--- +flowchart TD + subgraph workstation["Client Workstation"] + T["Cluster Toolkit"] + B("Kubernetes API") + A["helm install"] + Y["gcloud"] + end + subgraph imagerepo["Build Image"] + H["Artifact Registry"] + G["Cloud Build"] + end + subgraph huggingface["Hugging Face Hub"] + I["Model Weights"] + end + subgraph gke["GKE Cluster (a4)"] + C["Deployment"] + D["Pod"] + E["SGLang Container"] + F["Service"] + end + subgraph storage["Cloud Storage"] + J["Bucket"] + end + + %% Logical/actual flow + T -- Create Cluster --> gke + A --> B + G -- Pushes Image --> H + B --> C & F + C --> D + D --> E + F --> C + H -- Pulls Image --> E + E -- Downloads at runtime --> I + E -- Write logs --> J + Y -- Run Build --> imagerepo + + + %% Layout control + gke ~~~ imagerepo +``` + +* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment. +* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running. +* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server. +* **Pod:** The smallest deployable unit in Kubernetes. The SGLang container runs inside this pod on a GPU-enabled node. +* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts. + + +## 3. Environment Setup (One-Time) + +[Back to Top](#table-of-contents) + +First, you'll configure your local environment. These steps are required once before you can deploy any models. + + +### 3.1. Clone the Repository + +```bash +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=$(pwd) +export RECIPE_ROOT=$REPO_ROOT/inference/a4/single-host-serving/sglang/wan2.2 +``` + + +### 3.2. Configure Environment Variables + +This is the most critical step. These variables are used in subsequent commands to target the correct resources. + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export KUEUE_NAME= +export GCS_BUCKET= +export SGLANG_IMAGE=lmsysorg/sglang +export SGLANG_VERSION=latest + +# Set the project for gcloud commands +gcloud config set project $PROJECT_ID +``` + +Replace the following values: + +| Variable | Description | Example | +| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` | +| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` | +| `CLUSTER_NAME` | The name of your GKE cluster. | `a4-gke-cluster` | +| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Verify the name in your cluster. | `a4` | +| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` | +| `SGLANG_IMAGE` | The name for the Docker image to be built. | `lmsysorg/sglang` | +| `SGLANG_VERSION` | The tag/version for the Docker image. | `latest` | + + + +### 3.3. Connect to your GKE Cluster + +Fetch credentials for `kubectl` to communicate with your cluster. + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + + +### 3.4. Get Hugging Face token + +To access models through Hugging Face, you'll need a Hugging Face token. + 1. Create a [Hugging Face account](https://huggingface.co/) if you don't have one. + 2. For **gated models** ensure you have requested and been granted access on Hugging Face before proceeding. + 3. Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**. + 4. Select **New Token**. + 5. Specify a Name and a Role of at least `Read`. + 6. Select **Generate a token**. + 7. Copy the generated token to your clipboard. You'll use this later. + + + +### 3.5. Create Hugging Face Kubernetes Secret + +Create a Kubernetes Secret with your Hugging Face token to enable the job to download model checkpoints from Hugging Face. + +```bash +# Paste your Hugging Face token here +export HF_TOKEN= + +kubectl create secret generic hf-secret \ +--from-literal=hf_api_token=${HF_TOKEN} \ +--dry-run=client -o yaml | kubectl apply -f - +``` + +**You have now completed the environment setup!** You are ready to deploy a model. + + +## 4. Run the Recipe + +[Back to Top](#table-of-contents) + +This recipe supports the deployment of the following models: + +1. [Wan2.2](#serving-wan-model) + +Now, select a model to deploy. Each section below is self-contained for deploying a specific model. + +> [!NOTE] +> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face. + + +### 4.1. Model Variants + +[Back to Top](#table-of-contents) + +This recipe serves the [Wan2.2-T2V-A14B-Diffusers model](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) & [Wan2.2-I2V-A14B-Diffusers model](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) using SGLang framework on a single A4 node. + +Upon launching the SGLang server, it performs the following steps: + +1. Downloads the full Wan2.2 model checkpoints from Hugging Face. +2. Loads the model checkpoints and applies SGLang optimizations. +3. Server is ready to respond to requests. + + +#### 4.1.1. Deploy Wan2.2 + +1. **Install the helm chart to prepare and serve the model using SGLang framework:** + + ```bash + cd $RECIPE_ROOT + helm install -f values.yaml \ + --set-file workload_launcher=$REPO_ROOT/src/launchers/sglang-diffusion-launcher.sh \ + --set-file serving_config=$REPO_ROOT/src/frameworks/a4/sglang-configs/wan2.2.yaml \ + --set queue=${KUEUE_NAME} \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set workload.model.name=Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --set workload.image=${SGLANG_IMAGE}:${SGLANG_VERSION} \ + --set workload.framework=sglang \ + $USER-serving-wan2-2-model \ + $REPO_ROOT/src/helm-charts/a4/inference-templates/deployment + ``` + + This creates a Helm release and a Deployment named `$USER-serving-wan2-2-model`, and a Service named `$USER-serving-wan2-2-model-svc`. + +2. **Check the deployment status.** + + ```bash + kubectl get deployment/$USER-serving-wan2-2-model + ``` + + Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs. + + > [!NOTE] + > This deployment process can take **up to 30 minutes** as it downloads the model weights from Hugging Face and then the server loads the model weights. + + +#### 4.1.2. Interact with Wan2.2 model + +1. **Make a Video Generation API request:** + + Submit a text-to-video generation job. Note that video generation is asynchronous; the initial response will provide a Job ID. + + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- \ + curl http://localhost:8000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{ + "model":"Wan-AI/Wan2.2-T2V-A14B-Diffusers", + "prompt": "A cinematic, high-detailed shot of a futuristic city with flying vehicles at sunset, 4k resolution.", + "num_frames": 81, + "fps": 16, + "size": "1280x720", + "seed": 1024 + }' + ``` +2. **Example for I2V (Image-to-Video)** + + Submit a image-to-video generation job. Note that video generation is asynchronous; the initial response will provide a Job ID. + + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- \ + curl http://localhost:8000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Wan-AI/Wan2.2-I2V-A14B-Diffusers", + "prompt": "The character in the image starts walking toward the camera, cinematic lighting.", + "input_reference": "assets/sampleImage.png", + "num_frames": 81, + "fps": 16 + }' + ``` +2. **Generate a Video via Utility Script:** + + For a more automated experience, use the provided stream_video.sh script. First, forward the local port in one terminal: + + ```bash + kubectl port-forward svc/$USER-serving-wan2-2-model-svc 8000:8000 + ``` + + In a separate terminal, run the `stream_video.sh` utility script: + + ```bash + $RECIPE_ROOT/stream_video.sh "A curious raccoon in a field of sunflowers." + ``` + + +#### 4.1.3. Benchmark Wan2.2 + +1. Run the [SGLang benchmarking tool](https://docs.sglang.ai/references/benchmark_and_profiling.html) directly inside the running deployment: + + *Benchmark: Text-to-Video on 1 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'set -e + source /usr/local/gib/scripts/set_nccl_env.sh + sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --num-gpus 1 --tp-size 1 --num-frames 81 --save-output \ + --prompt "Cyberpunk city street in the rain, neon lights reflecting on puddles."' + ``` + *Benchmark: Text-to-Video on 4 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'set -e + source /usr/local/gib/scripts/set_nccl_env.sh + sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --num-gpus 4 --tp-size 4 --num-frames 93 --save-output \ + --prompt "Cyberpunk city street in the rain, neon lights reflecting on puddles."' + ``` + *Benchmark: Image-to-Video on 1 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'set -e + source /usr/local/gib/scripts/set_nccl_env.sh + sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers \ + --num-gpus 1 --tp-size 1 --num-frames 81 --save-output \ + --image "assets/sampleImage.png" \ + --prompt "The cat in the image blinks and looks at the camera."' + ``` + *Benchmark: Image-to-Video on 4 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'set -e + source /usr/local/gib/scripts/set_nccl_env.sh + sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers \ + --num-gpus 4 --tp-size 4 --num-frames 93 --save-output \ + --image "assets/sampleImage.png" \ + --prompt "The cat in the image blinks and looks at the camera."' + ``` + + Benchmark results are displayed in the logs. + + + +## 5. Monitoring and Troubleshooting + +[Back to Top](#table-of-contents) + +After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-wan2-2-model` and `$USER-serving-wan2-2-model-svc`). + + + +### 5.1. Check Deployment Status + +Check the status of your deployment. Replace the name if you deployed a different model. + +```bash +# Example for Wan +kubectl get deployment/$USER-serving-wan2-2-model +``` + +Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up. + +> [!NOTE] +> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready. + + +### 5.2. View Logs + +To see the logs from the SGLang server (useful for debugging), use the `-f` flag to follow the log stream: + +```bash +kubectl logs -f deployment/$USER-serving-wan2-2-model +``` + +You should see logs indicating SGLang server downloading/loading the model, and then starting the API server, similar to this: + +```bash +INFO: Started server process [2173] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) +INFO: 127.0.0.1:60018 - "GET /get_model_info HTTP/1.1" 200 OK +... +INFO: The server is fired up and ready to roll! +``` + + +### 5.3. Common Issues + +* **Error: `Connection refused` when using `port-forward`** + + If you are trying to stream responses using `kubectl port-forward` and get a connection error, check the following: + + 1. **Is the deployment ready?** Run `kubectl get deployment` and ensure the `READY` column is `1/1`. + 2. **Is the port-forward command running?** The command must remain active in its own terminal while you make requests. + 3. **Check Pod Logs:** Use `kubectl logs -f ...` to check for any error messages. + 4. **Try again:** Sometimes transient network issues can cause this. Stop the `port-forward` command (`Ctrl+C`) and run it again. + +* **Error: `deployments.apps "..." not found`** + + This indicates a typo in the deployment name. Use `helm list` to see the correct release names or `kubectl get deployments` to see all available deployment names. + + +## 6. Cleanup + +To avoid incurring further charges, clean up the resources you created. + +1. **Uninstall the Helm Release:** + + First, list your releases to get the deployed models: + + ```bash + # list deployed models + helm list --filter $USER-serving- + ``` + + Then, uninstall the desired release: + + ```bash + # uninstall the deployed model + helm uninstall + ``` + Replace `` with the helm release names listed. + +2. **Delete the Kubernetes Secret:** + + ```bash + kubectl delete secret hf-secret --ignore-not-found=true + ``` + +5. (Optional) Clean up files in your GCS bucket if benchmarking was performed. +6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. diff --git a/inference/a4/single-host-serving/sglang/wan2.2/stream_video.sh b/inference/a4/single-host-serving/sglang/wan2.2/stream_video.sh new file mode 100644 index 00000000..9e5c79df --- /dev/null +++ b/inference/a4/single-host-serving/sglang/wan2.2/stream_video.sh @@ -0,0 +1,62 @@ +#!/bin/bash + +[ $# -eq 0 ] && { echo "Usage: $0 \"Your prompt\""; exit 1; } + +PROMPT="$1" + +POD_NAME=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep "${USER}-serving-wan2-2-model" | head -n 1) + +if [ -z "$POD_NAME" ]; then + echo "Error: Could not find a running Wan2.2 pod." + echo "Please ensure your deployment is active." + exit 1 +fi + +echo "Using Pod: $POD_NAME" +echo "Submitting Video Job..." + +RESPONSE=$(kubectl exec "$POD_NAME" -- curl -s -X POST "http://localhost:8000/v1/videos" \ + -H "Content-Type: application/json" \ + -d "{ + \"model\": \"Wan-AI/Wan2.2-T2V-A14B-Diffusers\", + \"prompt\": \"$PROMPT\", + \"num_frames\": 81, + \"fps\": 16 + }") + +JOB_ID=$(echo "$RESPONSE" | jq -r '.id') + +if [ "$JOB_ID" == "null" ] || [ -z "$JOB_ID" ]; then + echo "Error: Failed to get Job ID. Response: $RESPONSE" + exit 1 +fi + +echo "Job Submitted! ID: $JOB_ID" + +# --- NEW: Polling Loop --- +echo -n "Rendering Video..." +while true; do + # Check status inside the pod + STATUS_REPLY=$(kubectl exec "$POD_NAME" -- curl -s "http://localhost:8000/v1/videos/$JOB_ID") + STATUS=$(echo "$STATUS_REPLY" | jq -r '.status') + PROGRESS=$(echo "$STATUS_REPLY" | jq -r '.progress') + + if [ "$STATUS" == "completed" ]; then + FILE_PATH=$(echo "$STATUS_REPLY" | jq -r '.file_path') + echo -e "\nSuccess! Video generated at: $FILE_PATH" + echo "To download run: kubectl cp $POD_NAME:$FILE_PATH ./output.mp4" + break + elif [ "$STATUS" == "failed" ]; then + ERROR_MSG=$(echo "$STATUS_REPLY" | jq -r '.error') + echo -e "\nError during generation: $ERROR_MSG" + exit 1 + else + # Print progress percentage if available, otherwise dots + if [ "$PROGRESS" != "null" ] && [ "$PROGRESS" != "0" ]; then + echo -ne "\rRendering Video... $PROGRESS%" + else + echo -n "." + fi + sleep 10 + fi +done diff --git a/inference/a4/single-host-serving/sglang/wan2.2/values.yaml b/inference/a4/single-host-serving/sglang/wan2.2/values.yaml new file mode 100644 index 00000000..dfca3bbf --- /dev/null +++ b/inference/a4/single-host-serving/sglang/wan2.2/values.yaml @@ -0,0 +1,48 @@ +queue: + +dwsSettings: + maxRunDurationSeconds: + +huggingface: + secretName: hf-secret + secretData: + token: "hf_api_token" + +volumes: + gcsVolumes: true + ssdMountPath: "/ssd" + gcsMounts: + - bucketName: + mountPath: "/gcs" + +service: + type: ClusterIP + ports: + http: 8000 + +workload: + model: + name: + gpus: 8 + image: + framework: + configFile: serving-args.yaml + configPath: /workload/configs + envs: + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LAUNCHER_SCRIPT + value: "/workload/launcher/launch-workload.sh" + - name: SERVER_ARGS_FILE + value: "/workload/configs/serving-args.yaml" + - name: HF_HOME + value: "/ssd" + - name: LD_LIBRARY_PATH + value: "/usr/local/nvidia/lib64:/usr/local/lib/" + +network: + subnetworks[]: + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5 + ncclSettings: + - name: NCCL_DEBUG + value: "WARN" diff --git a/inference/a4/single-host-serving/tensorrt-llm/README.md b/inference/a4/single-host-serving/tensorrt-llm/README.md new file mode 100644 index 00000000..8c884dd7 --- /dev/null +++ b/inference/a4/single-host-serving/tensorrt-llm/README.md @@ -0,0 +1,402 @@ +# Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A4 GKE Node Pool + +This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) framework on a single [A4 GKE Node pool](https://cloud.google.com/kubernetes-engine). + +This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference. + + +## Table of Contents + +* [1. Test Environment](#test-environment) +* [2. High-Level Architecture](#architecture) +* [3. Environment Setup (One-Time)](#environment-setup) + * [3.1. Clone the Repository](#clone-repo) + * [3.2. Configure Environment Variables](#configure-vars) + * [3.3. Connect to your GKE Cluster](#connect-cluster) + * [3.4. Get Hugging Face Token](#get-hf-token) + * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret) +* [4. Run the Recipe](#run-the-recipe) + * [4.1. Supported Models](#supported-models) + * [4.2. Deploy and Benchmark a Model](#deploy-model) +* [5. Monitoring and Troubleshooting](#monitoring) + * [5.1. Check Deployment Status](#check-status) + * [5.2. View Logs](#view-logs) +* [6. Cleanup](#cleanup) + + +## 1. Test Environment + +[Back to Top](#table-of-contents) + +The recipe uses the following setup: + +* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the TensorRT-LLM framework. + +This recipe has been optimized for and tested with the following configuration: + +* **GKE Cluster**: + * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later. + * A GPU node pool with 1 [a4-highgpu-8g](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) machine. + * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. + * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. + * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. + * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. + * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). +* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. + +> [!IMPORTANT] +> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a4x.md). +> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**. + + +## 2. High-Level Flow + +[Back to Top](#table-of-contents) + +Here is a simplified diagram of the flow that we follow in this recipe: + +```mermaid +--- +config: + layout: dagre +--- +flowchart TD + subgraph workstation["Client Workstation"] + T["Cluster Toolkit"] + B("Kubernetes API") + A["helm install"] + end + subgraph huggingface["Hugging Face Hub"] + I["Model Weights"] + end + subgraph gke["GKE Cluster (A4)"] + C["Deployment"] + D["Pod"] + E["TensorRT-LLM container"] + F["Service"] + end + subgraph storage["Cloud Storage"] + J["Bucket"] + end + + %% Logical/actual flow + T -- Create Cluster --> gke + A --> B + B --> C & F + C --> D + D --> E + F --> C + E -- Downloads at runtime --> I + E -- Write logs --> J + + + %% Layout control + gke +``` + +* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment. +* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running. +* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server. +* **Pod:** The smallest deployable unit in Kubernetes. The Triton server container with TensorRT-LLM runs inside this pod on a GPU-enabled node. +* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts. + + +## 3. Environment Setup (One-Time) + +[Back to Top](#table-of-contents) + +First, you'll configure your local environment. These steps are required once before you can deploy any models. + + +### 3.1. Clone the Repository + +```bash +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=$(pwd) +export RECIPE_ROOT=$REPO_ROOT/inference/a4x/single-host-serving/tensorrt-llm +``` + + +### 3.2. Configure Environment Variables + +This is the most critical step. These variables are used in subsequent commands to target the correct resources. + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export KUEUE_NAME= +export GCS_BUCKET= +export TRTLLM_VERSION=1.3.0rc5 + +# Set the project for gcloud commands +gcloud config set project $PROJECT_ID +``` + +Replace the following values: + +| Variable | Description | Example | +| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` | +| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` | +| `CLUSTER_NAME` | The name of your GKE cluster. | `a4-cluster` | +| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Verify the name in your cluster. | `a4` | +| `ARTIFACT_REGISTRY` | Full path to your Artifact Registry repository. | `us-central1-docker.pkg.dev/gcp-project-12345/my-repo` | +| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` | +| `TRTLLM_VERSION` | The tag/version for the Docker image. Other verions can be found at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release | `1.3.0rc5` | + + + +### 3.3. Connect to your GKE Cluster + +Fetch credentials for `kubectl` to communicate with your cluster. + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + + +### 3.4. Get Hugging Face token + +To access models through Hugging Face, you'll need a Hugging Face token. + 1. Create a [Hugging Face account](https://huggingface.co/) if you don't have one. + 2. For **gated models** like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding. + 3. Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**. + 4. Select **New Token**. + 5. Specify a Name and a Role of at least `Read`. + 6. Select **Generate a token**. + 7. Copy the generated token to your clipboard. You'll use this later. + + + +### 3.5. Create Hugging Face Kubernetes Secret + +Create a Kubernetes Secret with your Hugging Face token to enable the pod to download model checkpoints from Hugging Face. + +```bash +# Paste your Hugging Face token here +export HF_TOKEN= + +kubectl create secret generic hf-secret \ +--from-literal=hf_api_token=${HF_TOKEN} \ +--dry-run=client -o yaml | kubectl apply -f - +``` + + +## 4. Run the recipe + +[Back to Top](#table-of-contents) + +> [!NOTE] +> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face. + + +### 4.1. Supported Models + +[Back to Top](#table-of-contents) + +This recipe supports the following models. You can easily swap between them by changing the environment variables in the next step. + +Running TRTLLM inference benchmarking on these models are only tested and validated on A4 GKE nodes with certain combination of TP, PP, EP, number of GPU chips, input & output sequence length, precision, etc. + +Example model configuration YAML files included in this repo only show a certain combination of parallelism hyperparameters and configs for benchmarking purposes. Input and output length in `gpu-recipes/inference/a4/single-host-serving/tensorrt-llm/values.yaml` need to be adjusted according to the model and its configs. + +| Model Name | Hugging Face ID | Configuration File | Release Name Suffix | +| :--- | :--- | :--- | :--- | +| **DeepSeek-R1 671B** | `nvidia/DeepSeek-R1-NVFP4-v2` | `deepseek-r1-nvfp4.yaml` | `deepseek-r1` | +| **Qwen 3 235B A22B FP4** | `nvidia/Qwen3-235B-A22B-NVFP4` | `qwen3-235b-a22b-nvfp4.yaml` | `qwen3-235b-a22b` | +| **Qwen 3 32B** | `Qwen/Qwen3-32B` | `qwen3-32b.yaml` | `qwen3-32b` | + +> [!TIP] +> **DeepSeek-R1 671B** uses Nvidia's pre-quantized FP4 checkpoint. For more information, see the [Hugging Face model card](https://huggingface.co/nvidia/DeepSeek-R1-NVFP4-v2). + +> [!TIP] +> You can use the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) to quantize these models to FP8 or NVFP4 for improved performance. + + +### 4.2. Deploy and Benchmark a Model + +[Back to Top](#table-of-contents) + +The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/perf-benchmarking.md), a command-line tool from NVIDIA to benchmark the performance of TensorRT-LLM engine. + +1. **Configure model-specific variables.** Choose a model from the [table above](#supported-models) and set the variables: + + ```bash + # Example for DeepSeek-R1 NVFP4 + export HF_MODEL_ID="nvidia/DeepSeek-R1-NVFP4-v2" + export CONFIG_FILE="deepseek-r1-nvfp4.yaml" + export RELEASE_NAME="$USER-serving-deepseek-r1" + ``` + +2. **Install the helm chart:** + + ```bash + cd $RECIPE_ROOT + helm install -f values.yaml \ + --set-file workload_launcher=$REPO_ROOT/src/launchers/trtllm-launcher.sh \ + --set-file serving_config=$REPO_ROOT/src/frameworks/a4x/trtllm-configs/${CONFIG_FILE} \ + --set queue=${KUEUE_NAME} \ + --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \ + --set workload.model.name=${HF_MODEL_ID} \ + --set workload.image=nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_VERSION} \ + --set workload.framework=trtllm \ + ${RELEASE_NAME} \ + $REPO_ROOT/src/helm-charts/a4x/inference-templates/deployment + ``` + +3. **Check the deployment status:** + + ```bash + kubectl get deployment/${RELEASE_NAME} + ``` + + Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs. + + +## 5. Monitoring and Troubleshooting + +[Back to Top](#table-of-contents) + +After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-deepseek-r1` and `$USER-serving-deepseek-r1-svc`). + + + +### 5.1. Check Deployment Status + +Check the status of your deployment. Replace the name if you deployed a different model. + +```bash +# Example for DeepSeek-R1 671B +kubectl get deployment/$USER-serving-deepseek-r1 +``` + +Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up. + +> [!NOTE] +> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready. + + +### 5.2. View Logs + +To see the logs from the TRTLLM server (useful for debugging), use the `-f` flag to follow the log stream: + +```bash +kubectl logs -f deployment/$USER-serving-deepseek-r1 +``` + +You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this: + +```bash +Running benchmark for nvidia/DeepSeek-R1-NVFP4-v2 with ISL=1024, OSL=4096, TP=4, EP=4, PP=1 + +=========================================================== += PYTORCH BACKEND +=========================================================== +Model: nvidia/DeepSeek-R1-NVFP4-v2 +Model Path: /ssd/nvidia/DeepSeek-R1-NVFP4-v2 +Revision: N/A +TensorRT LLM Version: 1.2 +Dtype: bfloat16 +KV Cache Dtype: FP8 +Quantization: NVFP4 + +=========================================================== += MACHINE DETAILS +=========================================================== +NVIDIA B200, memory 178.35 GB, 4.00 GHz + +=========================================================== += REQUEST DETAILS +=========================================================== +Number of requests: 1000 +Number of concurrent requests: 752.9244 +Average Input Length (tokens): 1024.0000 +Average Output Length (tokens): 4096.0000 +=========================================================== += WORLD + RUNTIME INFORMATION +=========================================================== +TP Size: 4 +PP Size: 1 +EP Size: 4 +Max Runtime Batch Size: 128 +Max Runtime Tokens: 2048 +Scheduling Policy: GUARANTEED_NO_EVICT +KV Memory Percentage: 85.00% +Issue Rate (req/sec): 8.6889E+13 + +=========================================================== += PERFORMANCE OVERVIEW +=========================================================== +Request Throughput (req/sec): X.XX +Total Output Throughput (tokens/sec): X.XX +Total Token Throughput (tokens/sec): X.XX +Total Latency (ms): X.XX +Average request latency (ms): X.XX +Per User Output Throughput [w/ ctx] (tps/user): X.XX +Per GPU Output Throughput (tps/gpu): X.XX + +-- Request Latency Breakdown (ms) ----------------------- + +[Latency] P50 : X.XX +[Latency] P90 : X.XX +[Latency] P95 : X.XX +[Latency] P99 : X.XX +[Latency] MINIMUM: X.XX +[Latency] MAXIMUM: X.XX +[Latency] AVERAGE: X.XX + +=========================================================== += DATASET DETAILS +=========================================================== +Dataset Path: /ssd/token-norm-dist_DeepSeek-R1-NVFP4-v2_1024_4096_tp4.json +Number of Sequences: 1000 + +-- Percentiles statistics --------------------------------- + + Input Output Seq. Length +----------------------------------------------------------- +MIN: 1024.0000 4096.0000 5120.0000 +MAX: 1024.0000 4096.0000 5120.0000 +AVG: 1024.0000 4096.0000 5120.0000 +P50: 1024.0000 4096.0000 5120.0000 +P90: 1024.0000 4096.0000 5120.0000 +P95: 1024.0000 4096.0000 5120.0000 +P99: 1024.0000 4096.0000 5120.0000 +=========================================================== +``` + + +## 6. Cleanup + +To avoid incurring further charges, clean up the resources you created. + +1. **Uninstall the Helm Release:** + + First, list your releases to get the deployed models: + + ```bash + # list deployed models + helm list --filter $USER-serving- + ``` + + Then, uninstall the desired release: + + ```bash + # uninstall the deployed model + helm uninstall + ``` + Replace `` with the helm release names listed. + +2. **Delete the Kubernetes Secret:** + + ```bash + kubectl delete secret hf-secret --ignore-not-found=true + ``` + +3. (Optional) Delete the built Docker image from Artifact Registry if no longer needed. +4. (Optional) Delete Cloud Build logs. +5. (Optional) Clean up files in your GCS bucket if benchmarking was performed. +6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. \ No newline at end of file diff --git a/inference/a4/single-host-serving/tensorrt-llm/values.yaml b/inference/a4/single-host-serving/tensorrt-llm/values.yaml new file mode 100644 index 00000000..8fca4355 --- /dev/null +++ b/inference/a4/single-host-serving/tensorrt-llm/values.yaml @@ -0,0 +1,68 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +queue: + +dwsSettings: + maxRunDurationSeconds: + +huggingface: + secretName: hf-secret + secretData: + token: "hf_api_token" + +volumes: + gcsVolumes: true + ssdMountPath: "/ssd" + gcsMounts: + - bucketName: + mountPath: "/gcs" + +service: + type: ClusterIP + ports: + http: 8000 + +workload: + model: + name: + gpus: 4 + image: + framework: + configFile: serving-args.yaml + configPath: /workload/configs + envs: + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LAUNCHER_SCRIPT + value: "/workload/launcher/launch-workload.sh" + - name: SERVER_ARGS_FILE + value: "/workload/configs/serving-args.yaml" + - name: HF_HOME + value: "/ssd" + - name: LD_LIBRARY_PATH + value: "/usr/local/nvidia/lib64:/usr/local/lib/" + benchmarks: + experiments: + - isl: 1024 # input sequence length + osl: 4096 # output sequence length + # psl: 7900 # prefix sequence length + num_requests: 1000 + +network: + subnetworks[]: + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5 + ncclSettings: + - name: NCCL_DEBUG + value: "VERSION" \ No newline at end of file diff --git a/inference/a4x/single-host-serving/sglang/README.md b/inference/a4x/single-host-serving/sglang/README.md new file mode 100644 index 00000000..64297b95 --- /dev/null +++ b/inference/a4x/single-host-serving/sglang/README.md @@ -0,0 +1,445 @@ +# Single host inference benchmark of Wan2.2 with Sglang on A4x GKE Node Pool + +This document outlines the steps to serve and benchmark Wan2.2 model using the [SGLang](https://github.com/sgl-project/sglang/tree/main) framework on a single [A4x GKE Node pool](https://cloud.google.com/kubernetes-engine). + +This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and benchmarking a high-performance video generation model. + + +## Table of Contents + +* [1. Test Environment](#test-environment) +* [2. High-Level Architecture](#architecture) +* [3. Environment Setup (One-Time)](#environment-setup) + * [3.1. Clone the Repository](#clone-repo) + * [3.2. Configure Environment Variables](#configure-vars) + * [3.3. Connect to your GKE Cluster](#connect-cluster) + * [3.4. Get Hugging Face Token](#get-hf-token) + * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret) +* [4. Run the Recipe](#run-the-recipe) + * [4.1. Model varients](#serving-wan-model) +* [5. Monitoring and Troubleshooting](#monitoring) + * [5.1. Check Deployment Status](#check-status) + * [5.2. View Logs](#view-logs) + * [5.3. Common Issues](#troubleshooting) +* [6. Cleanup](#cleanup) + + +## 1. Test Environment + +[Back to Top](#table-of-contents) + +The recipe uses the following setup: + +* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the SGLang framework. + +This recipe has been optimized for and tested with the following configuration: + +* **GKE Cluster**: + * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.31.7-gke.1265000` or later. + * A GPU node pool with 1 [a4x-highgpu-4g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) machine. + * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. + * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. + * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. + * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. + * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). +* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. + +> [!IMPORTANT] +> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a4x.md). +> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**. + + +## 2. High-Level Flow + +[Back to Top](#table-of-contents) + +Here is a simplified diagram of the flow that we follow in this recipe: + +```mermaid +--- +config: + layout: dagre +--- +flowchart TD + subgraph workstation["Client Workstation"] + T["Cluster Toolkit"] + B("Kubernetes API") + A["helm install"] + Y["gcloud"] + end + subgraph imagerepo["Build Image"] + H["Artifact Registry"] + G["Cloud Build"] + end + subgraph huggingface["Hugging Face Hub"] + I["Model Weights"] + end + subgraph gke["GKE Cluster (a4x)"] + C["Deployment"] + D["Pod"] + E["SGLang Container"] + F["Service"] + end + subgraph storage["Cloud Storage"] + J["Bucket"] + end + + %% Logical/actual flow + T -- Create Cluster --> gke + A --> B + G -- Pushes Image --> H + B --> C & F + C --> D + D --> E + F --> C + H -- Pulls Image --> E + E -- Downloads at runtime --> I + E -- Write logs --> J + Y -- Run Build --> imagerepo + + + %% Layout control + gke ~~~ imagerepo +``` + +* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment. +* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running. +* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server. +* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts. +* **Pod:** The smallest deployable unit in Kubernetes. The SGLang container runs inside this pod on a GPU-enabled node. + + + +## 3. Environment Setup (One-Time) + +[Back to Top](#table-of-contents) + +First, you'll configure your local environment. These steps are required once before you can deploy any models. + + +### 3.1. Clone the Repository + +```bash +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=$(pwd) +export RECIPE_ROOT=$REPO_ROOT/inference/a4x/single-host-serving/sglang +``` + + +### 3.2. Configure Environment Variables + +This is the most critical step. These variables are used in subsequent commands to target the correct resources. + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export KUEUE_NAME= +export GCS_BUCKET= +export SGLANG_IMAGE=lmsysorg/sglang +export SGLANG_VERSION=latest + +# Set the project for gcloud commands +gcloud config set project $PROJECT_ID +``` + +Replace the following values: + +| Variable | Description | Example | +| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` | +| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` | +| `CLUSTER_NAME` | The name of your GKE cluster. | `a4x-gke-cluster` | +| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Verify the name in your cluster. | `a4x` | +| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` | +| `SGLANG_IMAGE` | The name for the Docker image to be built. | `lmsysorg/sglang` | +| `SGLANG_VERSION` | The tag/version for the Docker image. | `latest` | + + + +### 3.3. Connect to your GKE Cluster + +Fetch credentials for `kubectl` to communicate with your cluster. + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + + +### 3.4. Get Hugging Face token + +To access models through Hugging Face, you'll need a Hugging Face token. + 1. Create a [Hugging Face account](https://huggingface.co/) if you don't have one. + 2. For **gated models** ensure you have requested and been granted access on Hugging Face before proceeding. + 3. Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**. + 4. Select **New Token**. + 5. Specify a Name and a Role of at least `Read`. + 6. Select **Generate a token**. + 7. Copy the generated token to your clipboard. You'll use this later. + + + +### 3.5. Create Hugging Face Kubernetes Secret + +Create a Kubernetes Secret with your Hugging Face token to enable the job to download model checkpoints from Hugging Face. + +```bash +# Paste your Hugging Face token here +export HF_TOKEN= + +kubectl create secret generic hf-secret \ +--from-literal=hf_api_token=${HF_TOKEN} \ +--dry-run=client -o yaml | kubectl apply -f - +``` + + + +**You have now completed the environment setup!** You are ready to deploy a model. + + +## 4. Run the Recipe + +[Back to Top](#table-of-contents) + +This recipe supports the deployment of the following models: + +1. [Wan2.2](#serving-wan-model) + +Now, select a model to deploy. Each section below is self-contained for deploying a specific model. + +> [!NOTE] +> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face. + + +### 4.1. Model Variants + +[Back to Top](#table-of-contents) + +This recipe serves the [Wan2.2-T2V-A14B-Diffusers model](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) & [Wan2.2-I2V-A14B-Diffusers model](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) using SGLang framework on a single A4x node. + +Upon launching the SGLang server, it performs the following steps: + +1. Downloads the full Wan2.2 model checkpoints from Hugging Face. +2. Loads the model checkpoints and applies SGLang optimizations. +3. Server is ready to respond to requests. + + +#### 4.1.1. Deploy Wan2.2 + +1. **Install the helm chart to prepare and serve the model using SGLang framework:** + + ```bash + cd $RECIPE_ROOT + helm install -f values.yaml \ + --set-file workload_launcher=$REPO_ROOT/src/launchers/sglang-diffusion-launcher.sh \ + --set-file serving_config=$REPO_ROOT/src/frameworks/a4x/sglang-configs/wan2.2.yaml \ + --set queue=${KUEUE_NAME} \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set workload.model.name=Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --set workload.image=${SGLANG_IMAGE}:${SGLANG_VERSION} \ + --set workload.framework=sglang \ + $USER-serving-wan2-2-model \ + $REPO_ROOT/src/helm-charts/a4x/inference-templates/deployment + ``` + + This creates a Helm release and a Deployment named `$USER-serving-wan2-2-model`, and a Service named `$USER-serving-wan2-2-model-svc`. + +2. **Check the deployment status.** + + ```bash + kubectl get deployment/$USER-serving-wan2-2-model + ``` + + Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs. + + > [!NOTE] + > This deployment process can take **up to 30 minutes** as it downloads the model weights from Hugging Face and then the server loads the model weights. + + +#### 4.1.2. Interact with Wan2.2 model + +1. **Make a Video Generation API request:** + + Submit a text-to-video generation job. Note that video generation is asynchronous; the initial response will provide a Job ID. + + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- \ + curl -s http://localhost:8000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{ + "model":"Wan-AI/Wan2.2-T2V-A14B-Diffusers", + "prompt": "A cinematic, high-detailed shot of a futuristic city with flying vehicles at sunset, 4k resolution.", + "num_frames": 81, + "fps": 16, + "size": "1280x720", + "seed": 1024 + }' | jq '.' + ``` +2. **Example for I2V (Image-to-Video)** + + First, open a new terminal session and forward a local port to the service to allow your local machine to communicate with the model server: + + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- \ + curl -s http://localhost:8000/v1/videos \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Wan-AI/Wan2.2-I2V-A14B-Diffusers", + "prompt": "The character in the image starts walking toward the camera, cinematic lighting.", + "input_reference": "assets/sampleImage.png", + "num_frames": 81, + "fps": 16 + }' | jq '.' + ``` +2. **Generate a Video via Utility Script:** + + For a more automated experience, use the provided stream_video.sh script. First, forward the local port in one terminal: + + ```bash + kubectl port-forward svc/$USER-serving-wan2-2-model-svc 8000:8000 + ``` + + In a separate terminal, run the `stream_video.sh` utility script: + + ```bash + $RECIPE_ROOT/stream_video.sh "A curious raccoon in a field of sunflowers." + ``` + + +#### 4.1.3. Benchmark Wan2.2 + +1. Run the [SGLang benchmarking tool](https://docs.sglang.ai/references/benchmark_and_profiling.html) directly inside the running deployment: + + *Benchmark: Text-to-Video on 1 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --num-gpus 1 --tp-size 1 --num-frames 81 --save-output \ + --prompt "Cyberpunk city street in the rain, neon lights reflecting on puddles."' + ``` + *Benchmark: Text-to-Video on 4 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'sglang generate --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \ + --num-gpus 4 --tp-size 4 --num-frames 93 --save-output \ + --prompt "Cyberpunk city street in the rain, neon lights reflecting on puddles."' + ``` + + *Benchmark: Image-to-Video on 1 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers \ + --num-gpus 1 --tp-size 1 --num-frames 81 --save-output \ + --image "assets/sampleImage" \ + --prompt "The cat in the image blinks and looks at the camera."' + ``` + *Benchmark: Image-to-Video on 4 GPU* + ```bash + kubectl exec -it deployment/$USER-serving-wan2-2-model -- /bin/sh -c \ + 'sglang generate --model-path Wan-AI/Wan2.2-I2V-A14B-Diffusers \ + --num-gpus 4 --tp-size 4 --num-frames 93 --save-output \ + --image "assets/sampleImage" \ + --prompt "The cat in the image blinks and looks at the camera."' + ``` + + Benchmark results are displayed in the logs. + + + +## 5. Monitoring and Troubleshooting + +[Back to Top](#table-of-contents) + +After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-wan2-2-model` and `$USER-serving-wan2-2-model-svc`). + + + +### 5.1. Check Deployment Status + +Check the status of your deployment. Replace the name if you deployed a different model. + +```bash +# Example for Wan +kubectl get deployment/$USER-serving-wan2-2-model +``` + +Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up. + +> [!NOTE] +> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready. + + +### 5.2. View Logs + +To see the logs from the SGLang server (useful for debugging), use the `-f` flag to follow the log stream: + +```bash +kubectl logs -f deployment/$USER-serving-wan2-2-model +``` + +You should see logs indicating SGLang server downloading/loading the model, and then starting the API server, similar to this: + +```bash +INFO: Started server process [2173] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) +INFO: 127.0.0.1:60018 - "GET /get_model_info HTTP/1.1" 200 OK +... +INFO: The server is fired up and ready to roll! +``` + + +### 5.3. Common Issues + +* **Error: `Connection refused` when using `port-forward`** + + If you are trying to stream responses using `kubectl port-forward` and get a connection error, check the following: + + 1. **Is the deployment ready?** Run `kubectl get deployment` and ensure the `READY` column is `1/1`. + 2. **Is the port-forward command running?** The command must remain active in its own terminal while you make requests. + 3. **Check Pod Logs:** Use `kubectl logs -f ...` to check for any error messages. + 4. **Try again:** Sometimes transient network issues can cause this. Stop the `port-forward` command (`Ctrl+C`) and run it again. + +* **Error: `denied: requested access to the resource is denied` during Cloud Build** + + This almost always means the `ARTIFACT_REGISTRY` environment variable is incorrect. It **must** be the full path: `-docker.pkg.dev//`. + +* **Error: `deployments.apps "..." not found`** + + This indicates a typo in the deployment name. Use `helm list` to see the correct release names or `kubectl get deployments` to see all available deployment names. + + +## 6. Cleanup + +To avoid incurring further charges, clean up the resources you created. + +1. **Uninstall the Helm Release:** + + First, list your releases to get the deployed models: + + ```bash + # list deployed models + helm list --filter $USER-serving- + ``` + + Then, uninstall the desired release: + + ```bash + # uninstall the deployed model + helm uninstall + ``` + Replace `` with the helm release names listed. + +2. **Delete the Kubernetes Secret:** + + ```bash + kubectl delete secret hf-secret --ignore-not-found=true + ``` + +3. (Optional) Delete the built Docker image from Artifact Registry if no longer needed. +4. (Optional) Delete Cloud Build logs. +5. (Optional) Clean up files in your GCS bucket if benchmarking was performed. +6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. diff --git a/inference/a4x/single-host-serving/sglang/stream_video.sh b/inference/a4x/single-host-serving/sglang/stream_video.sh new file mode 100644 index 00000000..1de656d1 --- /dev/null +++ b/inference/a4x/single-host-serving/sglang/stream_video.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +[ $# -eq 0 ] && { echo "Usage: $0 \"Your prompt\""; exit 1; } + +PROMPT="$1" + +POD_NAME=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" | grep "${USER}-serving-wan2-2-model" | head -n 1) + +if [ -z "$POD_NAME" ]; then + echo "Error: Could not find a running Wan2.2 pod." + echo "Please ensure your deployment is active." + exit 1 +fi + +echo "Using Pod: $POD_NAME" +echo "Submitting Video Job..." + +RESPONSE=$(kubectl exec "$POD_NAME" -- curl -s -X POST "http://localhost:8000/v1/videos" \ + -H "Content-Type: application/json" \ + -d "{ + \"model\": \"Wan-AI/Wan2.2-T2V-A14B-Diffusers\", + \"prompt\": \"$PROMPT\", + \"num_frames\": 81, + \"fps\": 16 + }") + +JOB_ID=$(echo "$RESPONSE" | jq -r '.id') + +if [ "$JOB_ID" == "null" ] || [ -z "$JOB_ID" ]; then + echo "Error: Failed to get Job ID. Response: $RESPONSE" + exit 1 +fi + +echo "Job Submitted! ID: $JOB_ID" + +echo -n "Rendering Video..." +while true; do + # Check status inside the pod + STATUS_REPLY=$(kubectl exec "$POD_NAME" -- curl -s "http://localhost:8000/v1/videos/$JOB_ID") + STATUS=$(echo "$STATUS_REPLY" | jq -r '.status') + PROGRESS=$(echo "$STATUS_REPLY" | jq -r '.progress') + + if [ "$STATUS" == "completed" ]; then + FILE_PATH=$(echo "$STATUS_REPLY" | jq -r '.file_path') + echo -e "\nSuccess! Video generated at: $FILE_PATH" + echo "To download run: kubectl cp $POD_NAME:$FILE_PATH ./output.mp4" + break + elif [ "$STATUS" == "failed" ]; then + ERROR_MSG=$(echo "$STATUS_REPLY" | jq -r '.error') + echo -e "\nError during generation: $ERROR_MSG" + exit 1 + else + # Print progress percentage if available, otherwise dots + if [ "$PROGRESS" != "null" ] && [ "$PROGRESS" != "0" ]; then + echo -ne "\rRendering Video... $PROGRESS%" + else + echo -n "." + fi + sleep 10 + fi +done diff --git a/inference/a4x/single-host-serving/sglang/values.yaml b/inference/a4x/single-host-serving/sglang/values.yaml new file mode 100644 index 00000000..626ab664 --- /dev/null +++ b/inference/a4x/single-host-serving/sglang/values.yaml @@ -0,0 +1,48 @@ +queue: + +dwsSettings: + maxRunDurationSeconds: + +huggingface: + secretName: hf-secret + secretData: + token: "hf_api_token" + +volumes: + gcsVolumes: true + ssdMountPath: "/ssd" + gcsMounts: + - bucketName: + mountPath: "/gcs" + +service: + type: ClusterIP + ports: + http: 8000 + +workload: + model: + name: + gpus: 4 + image: + framework: + configFile: serving-args.yaml + configPath: /workload/configs + envs: + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "1" + - name: LAUNCHER_SCRIPT + value: "/workload/launcher/launch-workload.sh" + - name: SERVER_ARGS_FILE + value: "/workload/configs/serving-args.yaml" + - name: HF_HOME + value: "/ssd" + - name: LD_LIBRARY_PATH + value: "/usr/local/nvidia/lib64:/usr/local/lib/" + +network: + subnetworks[]: + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.7 + ncclSettings: + - name: NCCL_DEBUG + value: "WARN" diff --git a/inference/a4x/single-host-serving/tensorrt-llm/README.md b/inference/a4x/single-host-serving/tensorrt-llm/README.md index 8bb4cb2d..1abec421 100644 --- a/inference/a4x/single-host-serving/tensorrt-llm/README.md +++ b/inference/a4x/single-host-serving/tensorrt-llm/README.md @@ -129,7 +129,7 @@ export CLUSTER_REGION= export CLUSTER_NAME= export KUEUE_NAME= export GCS_BUCKET= -export TRTLLM_VERSION=1.2.0rc2 +export TRTLLM_VERSION=1.3.0rc5 # Set the project for gcloud commands gcloud config set project $PROJECT_ID @@ -199,9 +199,15 @@ kubectl create secret generic hf-secret \ This recipe supports the following models. You can easily swap between them by changing the environment variables in the next step. +Running TRTLLM inference benchmarking on these models are only tested and validated on A4X GKE nodes with certain combination of TP, PP, EP, number of GPU chips, input & output sequence length, precision, etc. + +Example model configuration YAML files included in this repo only show a certain combination of parallelism hyperparameters and configs for benchmarking purposes. Input and output length in `gpu-recipes/inference/a4x/single-host-serving/tensorrt-llm/values.yaml` need to be adjusted according to the model and its configs. + | Model Name | Hugging Face ID | Configuration File | Release Name Suffix | | :--- | :--- | :--- | :--- | -| **DeepSeek-R1 671B** | `nvidia/DeepSeek-R1-NVFP4-v2` | `deepseek-r1-nvfp4.yaml` | `deepseek-r1-model` | +| **DeepSeek-R1 671B** | `nvidia/DeepSeek-R1-NVFP4-v2` | `deepseek-r1-nvfp4.yaml` | `deepseek-r1` | +| **Llama 3.1 405B NVFP4** | `nvidia/Llama-3.1-405B-Instruct-NVFP4` | `llama-3.1-405b.yaml` | `llama-3-1-405b-nvfp4` | +| **Llama 3.1 405B FP8** | `meta-llama/Llama-3.1-405B-Instruct-FP8` | `llama-3.1-405b.yaml` | `llama-3-1-405b-fp8` | | **Llama 3.1 70B** | `meta-llama/Llama-3.1-70B-Instruct` | `llama-3.1-70b.yaml` | `llama-3-1-70b` | | **Llama 3.1 8B** | `meta-llama/Llama-3.1-8B-Instruct` | `llama-3.1-8b.yaml` | `llama-3-1-8b` | | **Qwen 3 32B** | `Qwen/Qwen3-32B` | `qwen3-32b.yaml` | `qwen3-32b` | @@ -223,10 +229,10 @@ The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/mai 1. **Configure model-specific variables.** Choose a model from the [table above](#supported-models) and set the variables: ```bash - # Example for Llama 3.1 70B - export HF_MODEL_ID="meta-llama/Llama-3.1-70B-Instruct" - export CONFIG_FILE="llama-3.1-70b.yaml" - export RELEASE_NAME="$USER-serving-llama-3-1-70b" + # Example for DeepSeek-R1 NVFP4 + export HF_MODEL_ID="nvidia/DeepSeek-R1-NVFP4-v2" + export CONFIG_FILE="deepseek-r1-nvfp4.yaml" + export RELEASE_NAME="$USER-serving-deepseek-r1" ``` 2. **Install the helm chart:** @@ -258,7 +264,7 @@ The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/mai [Back to Top](#table-of-contents) -After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-deepseek-r1-model` and `$USER-serving-deepseek-r1-model-svc`). +After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-deepseek-r1` and `$USER-serving-deepseek-r1-svc`). @@ -268,7 +274,7 @@ Check the status of your deployment. Replace the name if you deployed a differen ```bash # Example for DeepSeek-R1 671B -kubectl get deployment/$USER-serving-deepseek-r1-model +kubectl get deployment/$USER-serving-deepseek-r1 ``` Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up. @@ -282,7 +288,7 @@ Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still s To see the logs from the TRTLLM server (useful for debugging), use the `-f` flag to follow the log stream: ```bash -kubectl logs -f deployment/$USER-serving-deepseek-r1-model +kubectl logs -f deployment/$USER-serving-deepseek-r1 ``` You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this: diff --git a/inference/a4x/single-host-serving/tensorrt-llm/values.yaml b/inference/a4x/single-host-serving/tensorrt-llm/values.yaml index 2560ff83..2489b38c 100644 --- a/inference/a4x/single-host-serving/tensorrt-llm/values.yaml +++ b/inference/a4x/single-host-serving/tensorrt-llm/values.yaml @@ -51,8 +51,8 @@ workload: value: "/workload/configs/serving-args.yaml" benchmarks: experiments: - - isl: 128 - osl: 128 + - isl: 2048 + osl: 2048 num_requests: 1000 network: diff --git a/src/frameworks/a4/sglang-configs/wan2.2.yaml b/src/frameworks/a4/sglang-configs/wan2.2.yaml new file mode 100644 index 00000000..8fce5589 --- /dev/null +++ b/src/frameworks/a4/sglang-configs/wan2.2.yaml @@ -0,0 +1,8 @@ +tp-size: 4 +num-gpus: 4 +trust-remote-code: true +text-encoder-cpu-offload: false +vae-cpu-offload: false +dit-cpu-offload: false +dit-layerwise-offload: false +port: 8000 diff --git a/src/frameworks/a4/trtllm-configs/deepseek-r1-nvfp4.yaml b/src/frameworks/a4/trtllm-configs/deepseek-r1-nvfp4.yaml new file mode 100644 index 00000000..5ebdf9d4 --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/deepseek-r1-nvfp4.yaml @@ -0,0 +1,35 @@ +tp_size: 4 +ep_size: 4 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.85 +llm_api_args: + cuda_graph_config: + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 20 + - 24 + - 32 + - 64 + - 96 + - 128 + - 160 + - 192 + - 256 + - 320 + - 384 + - 512 + enable_padding: true + enable_attention_dp: true + enable_chunked_prefill: true + kv_cache_config: + dtype: auto + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + moe_config: + backend: CUTLASS + print_iter_log: true \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/llama-3-1-405b.yaml b/src/frameworks/a4/trtllm-configs/llama-3-1-405b.yaml new file mode 100644 index 00000000..ce5b48c6 --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/llama-3-1-405b.yaml @@ -0,0 +1,4 @@ +tp_size: 4 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-fp8.yaml b/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-fp8.yaml new file mode 100644 index 00000000..2307321a --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-fp8.yaml @@ -0,0 +1,4 @@ +tp_size: 1 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml b/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml new file mode 100644 index 00000000..2307321a --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml @@ -0,0 +1,4 @@ +tp_size: 1 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-fp8.yaml b/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-fp8.yaml new file mode 100644 index 00000000..c55e5339 --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-fp8.yaml @@ -0,0 +1,35 @@ +tp_size: 4 +pp_size: 1 +ep_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.20 +# llm_api_args: +# cuda_graph_config: +# batch_sizes: +# - 1 +# - 2 +# - 4 +# - 8 +# - 16 +# - 20 +# - 24 +# - 32 +# - 64 +# - 96 +# - 128 +# - 160 +# - 192 +# - 256 +# - 320 +# - 384 +# - 512 +# enable_padding: true +# enable_attention_dp: true +# enable_chunked_prefill: true +# kv_cache_config: +# dtype: auto +# enable_block_reuse: false +# free_gpu_memory_fraction: 0.40 +# moe_config: +# backend: CUTLASS +# print_iter_log: true \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml b/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml new file mode 100644 index 00000000..ce5b48c6 --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml @@ -0,0 +1,4 @@ +tp_size: 4 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4/trtllm-configs/qwen3-32b.yaml b/src/frameworks/a4/trtllm-configs/qwen3-32b.yaml new file mode 100644 index 00000000..2307321a --- /dev/null +++ b/src/frameworks/a4/trtllm-configs/qwen3-32b.yaml @@ -0,0 +1,4 @@ +tp_size: 1 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4x/sglang-configs/wan2.2.yaml b/src/frameworks/a4x/sglang-configs/wan2.2.yaml new file mode 100644 index 00000000..8fce5589 --- /dev/null +++ b/src/frameworks/a4x/sglang-configs/wan2.2.yaml @@ -0,0 +1,8 @@ +tp-size: 4 +num-gpus: 4 +trust-remote-code: true +text-encoder-cpu-offload: false +vae-cpu-offload: false +dit-cpu-offload: false +dit-layerwise-offload: false +port: 8000 diff --git a/src/frameworks/a4x/trtllm-configs/llama-3-1-405b.yaml b/src/frameworks/a4x/trtllm-configs/llama-3-1-405b.yaml new file mode 100755 index 00000000..ce5b48c6 --- /dev/null +++ b/src/frameworks/a4x/trtllm-configs/llama-3-1-405b.yaml @@ -0,0 +1,4 @@ +tp_size: 4 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-fp8.yaml b/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-fp8.yaml new file mode 100644 index 00000000..2307321a --- /dev/null +++ b/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-fp8.yaml @@ -0,0 +1,4 @@ +tp_size: 1 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml b/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml new file mode 100644 index 00000000..2307321a --- /dev/null +++ b/src/frameworks/a4x/trtllm-configs/qwen2-5-vl-7b-nvfp4.yaml @@ -0,0 +1,4 @@ +tp_size: 1 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-fp8.yaml b/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-fp8.yaml new file mode 100644 index 00000000..c55e5339 --- /dev/null +++ b/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-fp8.yaml @@ -0,0 +1,35 @@ +tp_size: 4 +pp_size: 1 +ep_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.20 +# llm_api_args: +# cuda_graph_config: +# batch_sizes: +# - 1 +# - 2 +# - 4 +# - 8 +# - 16 +# - 20 +# - 24 +# - 32 +# - 64 +# - 96 +# - 128 +# - 160 +# - 192 +# - 256 +# - 320 +# - 384 +# - 512 +# enable_padding: true +# enable_attention_dp: true +# enable_chunked_prefill: true +# kv_cache_config: +# dtype: auto +# enable_block_reuse: false +# free_gpu_memory_fraction: 0.40 +# moe_config: +# backend: CUTLASS +# print_iter_log: true \ No newline at end of file diff --git a/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml b/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml new file mode 100644 index 00000000..ce5b48c6 --- /dev/null +++ b/src/frameworks/a4x/trtllm-configs/qwen3-235b-a22b-nvfp4.yaml @@ -0,0 +1,4 @@ +tp_size: 4 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.90 \ No newline at end of file diff --git a/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml b/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml index b81c3251..ce8b4ade 100644 --- a/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml +++ b/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml @@ -50,7 +50,7 @@ data: --kv_cache_free_gpu_mem_fraction 0.95 > $output_file cat $output_file - gsutil cp $output_file /gcs/benchmark_logs/ + gcloud storage cp $output_file /gcs/benchmark_logs/ rm -rf $engine_dir rm -f $dataset_file diff --git a/src/helm-charts/a4/inference-templates/deployment/templates/serving-launcher.yaml b/src/helm-charts/a4/inference-templates/deployment/templates/serving-launcher.yaml index ea3d1b86..0bce149f 100644 --- a/src/helm-charts/a4/inference-templates/deployment/templates/serving-launcher.yaml +++ b/src/helm-charts/a4/inference-templates/deployment/templates/serving-launcher.yaml @@ -171,6 +171,8 @@ spec: {{- end }} - name: NCCL_PLUGIN_PATH value: /usr/local/gib/lib64 + - name: LD_LIBRARY_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 {{- if $root.Values.network.gibVersion }} - name: NCCL_INIT_SCRIPT value: "/usr/local/gib/scripts/set_nccl_env.sh" @@ -180,6 +182,8 @@ spec: value: "{{ $root.Values.workload.model.name }}" - name: MODEL_DOWNLOAD_DIR value: "/ssd/{{ $root.Values.workload.model.name }}" + - name: TRTLLM_DIR + value: "/app/tensorrt_llm" {{- if $root.Values.workload.envs }} {{- toYaml .Values.workload.envs | nindent 12 }} {{- end }} @@ -189,6 +193,7 @@ spec: args: - | #!/bin/bash + pip install pyyaml hf_transfer if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" @@ -203,30 +208,46 @@ spec: fi ARGS=() + EXTRA_ARGS_FILE="/tmp/extra_llm_api_args.yaml" - if [ -f "$SERVER_ARGS_FILE" ]; then - echo "Loading server arguments from ConfigMap" - while IFS=': ' read -r key value || [ -n "$key" ]; do - [[ -z "$key" || "$key" == \#* ]] && continue - key=$(echo "$key" | xargs) - value=$(echo "$value" | xargs) + # Use Python to parse the main config file, extract llm_api_args, + # and generate the command-line arguments. + python -c " + import yaml + import sys - if [ -n "$key" ]; then - # Handle boolean values - if [[ "$value" == "true" ]]; then - # For true values, just add the flag without a value - ARGS+=("--$key") - elif [[ "$value" == "false" ]]; then - ARGS+=("--$key" "false") - elif [ -n "$value" ]; then - # For non-boolean values, add both the flag and its value - ARGS+=("--$key" "$value") - else - ARGS+=("--$key") - fi - fi - done < "$SERVER_ARGS_FILE" - fi + args = [] + llm_api_args = {} + config_file = sys.argv[1] + extra_args_file = sys.argv[2] + + try: + with open(config_file, 'r') as f: + config = yaml.safe_load(f) + + if 'llm_api_args' in config: + llm_api_args = config.pop('llm_api_args') + with open(extra_args_file, 'w') as f: + yaml.dump(llm_api_args, f) + + for key, value in config.items(): + if value is True: + args.append(f'--{key}') + elif value is not False: + args.append(f'--{key}') + args.append(str(value)) + + # Print the arguments for the shell script to capture + print(' '.join(args)) + + except Exception as e: + print(f'Error parsing config file: {e}', file=sys.stderr) + sys.exit(1) + " "$SERVER_ARGS_FILE" "$EXTRA_ARGS_FILE" > /tmp/launcher_args.txt + + # Read the generated arguments into the ARGS array + mapfile -t ARGS < <(tr ' ' '\n' < /tmp/launcher_args.txt) + rm /tmp/launcher_args.txt {{ if eq $root.Values.workload.framework "trtllm" }} {{- range $root.Values.workload.benchmarks.experiments }} diff --git a/src/launchers/sglang-diffusion-launcher.sh b/src/launchers/sglang-diffusion-launcher.sh new file mode 100644 index 00000000..c9feb224 --- /dev/null +++ b/src/launchers/sglang-diffusion-launcher.sh @@ -0,0 +1,40 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +#!/bin/bash + +set -eux # Exit immediately if a command exits with a non-zero status. + +echo "SGLang server arguments received:" +echo " $@" +echo "" + +echo "Launching SGLang server" + +export HF_HOME=/ssd + +if [ -z "$MODEL_NAME" ]; then + echo "Error: MODEL_NAME environment variable is not set." + exit 1 +fi + +echo "Using MODEL_NAME: $MODEL_NAME" + +apt update +apt install pciutils + +sglang serve \ + --model-path "$MODEL_NAME" \ + "$@" + +echo "Server bringup is complete. SGLang server command finished." diff --git a/src/launchers/trtllm-launcher.sh b/src/launchers/trtllm-launcher.sh index 5e8ee091..07d99318 100644 --- a/src/launchers/trtllm-launcher.sh +++ b/src/launchers/trtllm-launcher.sh @@ -85,7 +85,7 @@ parse_serving_config() { for ((index = 0; index < ${#SERVING_CONFIG[@]}; )); do current_arg="${SERVING_CONFIG[$index]}" - next_arg="${SERVING_CONFIG[$((index + 1))]}" + next_arg=${SERVING_CONFIG[$((index + 1))]:-} # Handle --key=value format if [[ "$current_arg" =~ ^--[^=]+=.+ ]]; then @@ -120,6 +120,11 @@ parse_serving_config() { ep_size=${SERVING_CONFIG_DICT["ep_size"]:=1} backend=${SERVING_CONFIG_DICT["backend"]:="tensorrt"} kv_cache_free_gpu_mem_fraction=${SERVING_CONFIG_DICT["kv_cache_free_gpu_mem_fraction"]:=0.95} + modality=${SERVING_CONFIG_DICT["modality"]:=""} + streaming=${SERVING_CONFIG_DICT["streaming"]:="false"} + max_input_len=${SERVING_CONFIG_DICT["max_input_len"]:=""} + max_batch_size=${SERVING_CONFIG_DICT["max_batch_size"]:=""} + custom_dataset=${SERVING_CONFIG_DICT["dataset"]:=""} } print_configuration() { @@ -158,28 +163,41 @@ run_benchmark() { local backend=$8 local kv_cache_free_gpu_mem_fraction=$9 - echo "Running benchmark for $model_name with ISL=$isl, OSL=$osl, TP=$tp_size, PP=$pp_size, EP=$ep_size, backend=$7" + echo "Running benchmark for $model_name with ISL=$isl, OSL=$osl, TP=$tp_size, PP=$pp_size, EP=$ep_size, backend=$backend" + + vl_args="" + if [ -n "$modality" ]; then vl_args="$vl_args --modality $modality"; fi + if [ "$streaming" == "true" ]; then vl_args="$vl_args --streaming"; fi + if [ -n "$max_input_len" ]; then vl_args="$vl_args --max_input_len $max_input_len"; fi + if [ -n "$max_batch_size" ]; then vl_args="$vl_args --max_batch_size $max_batch_size"; fi + + dataset_file=$custom_dataset + if [ -z "$dataset_file" ]; then + dataset_file="/ssd/token-norm-dist_${model_name##*/}_${isl}_${osl}_tp${tp_size}.json" + echo "Preparing dataset" + python3 $TRTLLM_DIR/benchmarks/cpp/prepare_dataset.py \ + --tokenizer=$model_name \ + --stdout token-norm-dist \ + --num-requests=$num_requests \ + --input-mean=$isl \ + --output-mean=$osl \ + --input-stdev=0 \ + --output-stdev=0 >$dataset_file + fi - dataset_file="/ssd/token-norm-dist_${model_name##*/}_${isl}_${osl}_tp${tp_size}.json" output_file="/ssd/output_${model_name##*/}_isl${isl}_osl${osl}_tp${tp_size}.txt" extra_args_file="/tmp/extra_llm_api_args.yaml" extra_args="" if [ -f "$extra_args_file" ]; then extra_args="--extra_llm_api_options $extra_args_file" fi - - echo "Preparing dataset" - python3 $TRTLLM_DIR/benchmarks/cpp/prepare_dataset.py \ - --tokenizer=$model_name \ - --stdout token-norm-dist \ - --num-requests=$num_requests \ - --input-mean=$isl \ - --output-mean=$osl \ - --input-stdev=0 \ - --output-stdev=0 >$dataset_file + + export TOKENIZERS_PARALLELISM=false + echo "enable_cuda_graph: false" > /tmp/extra_llm_api_args.yaml if [[ $backend == "pytorch" ]]; then echo "Running throughput benchmark" + export NCCL_P2P_LEVEL=PHB trtllm-bench \ --model $model_name \ --model_path /ssd/${model_name} throughput \ @@ -188,7 +206,8 @@ run_benchmark() { --pp $pp_size \ --ep $ep_size \ --backend "pytorch" \ - --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args >$output_file + --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction \ + $extra_args $vl_args > $output_file else echo "Building engine" trtllm-bench \ @@ -213,7 +232,7 @@ run_benchmark() { fi cat $output_file - gsutil cp $output_file /gcs/benchmark_logs/trtllm/ + gcloud storage cp $output_file /gcs/benchmark_logs/trtllm/ rm -rf $engine_dir rm -f $dataset_file diff --git a/src/utils/data_processing/waymo_dataset/README.md b/src/utils/data_processing/waymo_dataset/README.md index 37345bc9..8834355f 100644 --- a/src/utils/data_processing/waymo_dataset/README.md +++ b/src/utils/data_processing/waymo_dataset/README.md @@ -26,7 +26,7 @@ Before running the script, ensure you have the following prerequisites installed #### Google Cloud SDK -The `gsutil` command-line tool is required to download the dataset from Google Cloud Storage. +The `gcloud storage` command-line tool is required to download the dataset from Google Cloud Storage. 1. Install the Google Cloud SDK. 2. Authenticate with Google Cloud: @@ -103,7 +103,7 @@ print(processed_dataset[0]) ### 5. Common Issues -1. **`gsutil` Command Not Found**: This error occurs if the Google Cloud SDK is not installed or not in your system's `PATH`. Please follow the installation instructions in the Prerequisites section. +1. **`gcloud storage` Command Not Found**: This error occurs if the Google Cloud SDK is not installed or not in your system's `PATH`. Please follow the installation instructions in the Prerequisites section. 2. **GCS Access Denied / 401 Errors**: This indicates an authentication or permission issue. - Ensure you have registered for the Waymo dataset. @@ -111,4 +111,3 @@ print(processed_dataset[0]) - Make sure your GCP user or service account has `Storage Object Viewer` permissions on the `gs://waymo_open_dataset_v_2_0_1/` bucket. 3. **Corrupted Files**: If a specific Parquet file fails to process, it might be corrupted. The script is designed to be robust and will log an error and skip the corrupted segment, continuing with the rest of the data. - diff --git a/src/utils/data_processing/waymo_dataset/waymo_perception_data_processor.py b/src/utils/data_processing/waymo_dataset/waymo_perception_data_processor.py index 6837dc5c..97a01767 100644 --- a/src/utils/data_processing/waymo_dataset/waymo_perception_data_processor.py +++ b/src/utils/data_processing/waymo_dataset/waymo_perception_data_processor.py @@ -153,7 +153,7 @@ def _download_dataset_locally(input_dir: str): # If PARQUET_ID is empty, download all parquets in the directory source_for_gsutil = os.path.join(remote_path_item, "*.parquet") - gsutil_command = ["gsutil", "-m", "cp", "-r", source_for_gsutil, local_path_dir] + gsutil_command = ["gcloud", "storage", "cp", "--recursive", source_for_gsutil, local_path_dir] logger.info( f"[DATALOADER] Downloading dataset. Command: {' '.join(gsutil_command)}" @@ -164,9 +164,9 @@ def _download_dataset_locally(input_dir: str): ) logger.info(f"[DATALOADER] Successfully downloaded to {local_path_dir}.") if result.stdout: - logger.info(f"[DATALOADER] gsutil stdout: {result.stdout}") + logger.info(f"[DATALOADER] gcloud stdout: {result.stdout}") if result.stderr: # gsutil often prints status to stderr even on success - logger.info(f"[DATALOADER] gsutil stderr: {result.stderr}") + logger.info(f"[DATALOADER] gcloud stderr: {result.stderr}") except subprocess.CalledProcessError as e: logger.error( f"[Fatal][DATALOADER] Failed to download from {source_for_gsutil} to {local_path_dir}. " diff --git a/training/a3ultra/llama3-1-70b/nemo-pretraining-gke/README.md b/training/a3ultra/llama3-1-70b/nemo-pretraining-gke/README.md index 3fe216c4..c5e5ec0d 100644 --- a/training/a3ultra/llama3-1-70b/nemo-pretraining-gke/README.md +++ b/training/a3ultra/llama3-1-70b/nemo-pretraining-gke/README.md @@ -11,7 +11,7 @@ For this recipe, the following setup is used: - Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine). - Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) - resource which manages the execution of the + resource which manages the execution of the [NeMo pretraining workload](https://github.com/NVIDIA-NeMo/NeMo/blob/v2.4.0/examples/nlp/language_modeling/megatron_gpt_pretraining.py). ## Test environment diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/recipe_launch_command.sh b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/recipe_launch_command.sh deleted file mode 100644 index 892961cb..00000000 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/recipe_launch_command.sh +++ /dev/null @@ -1 +0,0 @@ -helm install joeywan-ubench-6wsw . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=/tmp/ubench_recipe/joeywan-ubench-6wsw/custom_setup_experiment.py --set workload.image=nvcr.io/nvidia/nemo:26.02 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/joeywan-ubench-6wsw --set queue=a4 \ No newline at end of file diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/Chart.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/Chart.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/Chart.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/Chart.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/README.md similarity index 98% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/README.md index aa487339..c5095c27 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/README.md @@ -75,7 +75,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder. git clone https://github.com/ai-hypercomputer/gpu-recipes.git cd gpu-recipes export REPO_ROOT=`git rev-parse --show-toplevel` -export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe +export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe cd $RECIPE_ROOT ``` diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/custom_setup_experiment.py b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/custom_setup_experiment.py similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/custom_setup_experiment.py rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/custom_setup_experiment.py diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/launcher.sh b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/launcher.sh similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/launcher.sh rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/launcher.sh diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/recipe_launch_command.sh b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/recipe_launch_command.sh similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/recipe_launch_command.sh rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/recipe_launch_command.sh diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-config-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-config-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-job.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-job.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-job.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-job.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-launcher-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-svc.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-svc.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-svc.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/values.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/values.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/values.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-SEQ4096-GBS2048/recipe/values.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/Chart.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/Chart.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/Chart.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/Chart.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/README.md b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/README.md similarity index 98% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/README.md rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/README.md index fc9352fb..f3954712 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/README.md +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/README.md @@ -75,7 +75,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder. git clone https://github.com/ai-hypercomputer/gpu-recipes.git cd gpu-recipes export REPO_ROOT=`git rev-parse --show-toplevel` -export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe +export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe cd $RECIPE_ROOT ``` diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/custom_setup_experiment.py b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/custom_setup_experiment.py similarity index 51% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/custom_setup_experiment.py rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/custom_setup_experiment.py index 369cfa0a..2337fdec 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/custom_setup_experiment.py +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/custom_setup_experiment.py @@ -1,19 +1,3 @@ -#!/usr/bin/env python3 - -# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - import glob import logging import os @@ -60,109 +44,6 @@ logger = logging.getLogger(__name__) -def check_training_finished(log_file_path: str) -> bool: - """Check if training is finished.""" - with open(log_file_path, "r") as f: - log_lines = f.readlines() - log = "\n".join(log_lines) - return "StopIteration" in log or "after training is done" in log or "exiting program at iteration" in log - - -def check_slurm_timeout(log_file_path: str) -> bool: - """Check if Slurm job timed out.""" - with open(log_file_path, "r") as f: - log_lines = f.readlines() - log = "\n".join(log_lines) - return "DUE TO TIME LIMIT" in log - - -def is_flaky_failure(log_file_path: str) -> bool: - """Check if Slurm job failed due to flaky failure.""" - with open(log_file_path, "r") as f: - log_lines = f.readlines() - log = "\n".join(log_lines) - - return ( - "The server socket has failed to listen on any local network address." in log - or "Some NCCL operations have failed or timed out." in log - or "uncorrectable ECC error encountered" in log - or "illegal memory access" in log - or "illegal instruction" in log - or "torch.distributed.DistNetworkError" in log - or "Segmentation fault" in log - or "found NaN in" in log - or "For debugging consider passing CUDA_LAUNCH_BLOCKING=1" in log - or "double free or corruption" in log - or "Call to CUDA function failed." in log - or "Connection reset by peer" in log - or "invalid pointer" in log - or "malloc(): unaligned tcache chunk detected" in log - or "zmq.error.ZMQError: Address already in use" in log - or "We couldn't connect to 'https://huggingface.co'" in log - or "Unpack failed: incomplete input" in log - or "unspecified launch failure" in log - or "free(): corrupted unsorted chunks" in log - or "Segfault encountered" in log - or "Fatal glibc error" in log - or "EOFError: No data left in file" in log - ) - - -def build_performance_config(args) -> Optional[Dict[str, Any]]: - """Build performance configuration from command-line arguments. - - Args: - args: Parsed command-line arguments - - Returns: - Dictionary with performance configuration or None if performance is disabled - """ - config = {} - - performance_params = { - "timing_threshold": args.timing_threshold, - "skip_first_percent_time": args.skip_first_percent_time, - } - - for key, value in performance_params.items(): - if value is not None: - config[key] = value - - return config if config else None - - -def ensure_logs_where_written(log_file_paths: List[str]): - """Ensure logs were written to disk.""" - if len(log_file_paths) != 1: - raise FileNotFoundError( - f"Unexpected number of log files found: {log_file_paths}. Expected 1, got {len(log_file_paths)}" - ) - - -def get_job_dir_and_status_from_run(exp_name: str): - """Get job directory and status from run.""" - result_dict = run.Experiment.from_title(exp_name).status(return_dict=True) - _, job_dict = list(result_dict.items())[0] - job_dir = job_dict["local_dir"] - job_status = str(job_dict["status"]) - return job_dir, job_status - - -def maybe_increase_n_attempts_on_flaky_failure( - n_attempts: int, - max_retries: int, - is_finished_experiment: bool, - is_long_convergence_run: bool, - log_file_paths: List[str], -): - """Maybe increase number of attempts.""" - if not is_finished_experiment and not is_long_convergence_run: - if is_flaky_failure(log_file_paths[-1]): - n_attempts += 1 - else: - n_attempts = max_retries # On non-flaky failures, we don't need to restart the experiment. - - return n_attempts def main( @@ -336,151 +217,14 @@ def main( logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) - is_finished_experiment = False # An experiment might consist of multiple training runs, due to restarts. - is_testing_passed = False # Whether the testing passed convergence and performance validation. - error_msg = None - n_attempts = 0 - exp_name = ( - exp_name[:37] if dgxc_cluster is not None else exp_name - ) # Some k8s clusters have a limit on the length of the experiment name. - wandb_run_id = None - while n_attempts <= max_retries: - while is_finished_experiment is False: - if HAVE_WANDB: - wandb_run_id = ( - (wandb_run_id or wandb.util.generate_id()) if is_long_convergence_run else wandb.util.generate_id() - ) - executor.env_vars.update( - { - "WANDB_RUN_ID": wandb_run_id, - "WANDB_RESUME": "allow", - } - ) - if wandb_key is not None: - executor.env_vars["WANDB_API_KEY"] = wandb_key - - run.run( - nemorun_script, - executor=executor, - plugins=plugins, - dryrun=dryrun, - detach=detach, - name=exp_name, - ) - if dryrun: - logger.info("dryrun requested: exiting") - return - - def _copy_logs_to_gcp(job_dir_path): - import shutil - import glob - - artifact_dir = os.environ.get("ARTIFACT_DIR", "/tmp/artifacts") - dest_logs_dir = os.path.join(artifact_dir, "logs") - os.makedirs(dest_logs_dir, exist_ok=True) - - try: - log_files = glob.glob(f"{job_dir_path}/log-*.out") + glob.glob(f"{job_dir_path}/log-*.err") - for log_f in log_files: - shutil.copy(log_f, dest_logs_dir) - msg = f"Copied {log_f} to {dest_logs_dir}" - print(msg) - logger.info(msg) - except Exception as e: - print(f"Failed to copy logs to GCP: {e}") - logger.error(f"Failed to copy logs to GCP: {e}") - - - job_dir, job_status = get_job_dir_and_status_from_run(exp_name) - - if job_status not in ["SUCCEEDED", "SUBMITTED", "PENDING", "RUNNING"]: - _copy_logs_to_gcp(job_dir) - raise Exception(f"Experiment failed for {exp_name} with status: {job_status}.") - - if detach: - is_finished_experiment = True - is_testing_passed = True - break - - log_file_paths = list(Path(f"{job_dir}").glob("log-*_0.out")) - ensure_logs_where_written(log_file_paths) - - is_finished_experiment = ( - check_training_finished(log_file_paths[-1]) if is_long_convergence_run else (job_status == "SUCCEEDED") - ) - - n_attempts = maybe_increase_n_attempts_on_flaky_failure( - n_attempts=n_attempts, - max_retries=max_retries, - is_finished_experiment=is_finished_experiment, - is_long_convergence_run=is_long_convergence_run, - log_file_paths=log_file_paths, - ) - - if not is_finished_experiment and n_attempts <= max_retries: - logger.error(f"Starting attempt {n_attempts + 1} of {max_retries + 1} for {exp_name}") - - if not is_finished_experiment: - break - - if is_finished_experiment is True and detach is False: - log_paths = sorted( - list(glob.glob(f"{get_nemorun_home()}/experiments/{exp_name}/{exp_name}_*/{exp_name}/log-*_0.out")) - ) - - if not is_long_convergence_run: - log_paths = [log_paths[-1]] - - logger.info(f"Starting convergence check for {model_family_name}_{model_recipe_name}") - wandb_run = None - if HAVE_WANDB and wandb_key: - wandb_run = wandb.init( - project=wandb_project_name, entity=wandb_entity_name, id=wandb_run_id, resume="allow" - ) - - logger.info("Waiting 10 seconds for I/O to settle") - time.sleep(10) - - is_testing_passed, error_msg = calc_convergence_and_performance( - model_family_name=model_family_name, - model_recipe_name=model_recipe_name, - assets_dir=os.path.join(job_dir, exp_name), - log_paths=log_paths, - loss_metric="lm loss", - timing_metric="elapsed time per iteration (ms)", - alloc_metric="alloc", - max_alloc_metric="max_alloc", - golden_values_path=golden_values_path, - convergence_config=convergence_params, - performance_config=performance_params, - memory_config=memory_params, - wandb_run=wandb_run, - ) - - if wandb_run: - wandb_run.finish() - wandb.teardown(exit_code=int(not is_testing_passed)) - - if not is_long_convergence_run: - n_attempts = max_retries - is_finished_experiment = True - if not is_testing_passed: - _copy_logs_to_gcp(job_dir) - break - - if is_finished_experiment and is_testing_passed: - break - - if not is_testing_passed and error_msg is not None: - raise AssertionError(error_msg) - if is_testing_passed and error_msg is not None: - logger.warning(error_msg) - - if not is_finished_experiment: - _copy_logs_to_gcp(job_dir) - raise Exception("Megatron-Bridge CI test job failed") - elif is_finished_experiment and not detach: - logger.info("Megatron-Bridge CI test job completed successfully!") + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) if __name__ == "__main__": diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/launcher.sh b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/launcher.sh similarity index 68% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/launcher.sh rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/launcher.sh index 3cb08b61..dd15b2d0 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/launcher.sh +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/launcher.sh @@ -7,7 +7,7 @@ EOF } parse_args() { - while [ "$1" != "" ]; do + while [[ "$1" != "" ]]; do case $(grep -o "=" <<< "$1" | wc -l) in 1 ) config_overrides+=("$1") @@ -25,15 +25,15 @@ parse_args() { config_overrides=() parse_args "$@" -if [ -z "${config_overrides}" ]; then +if [[ -z "${config_overrides[*]}" ]]; then echo "No NeMo config overrides specified" else echo "NeMo config overrides:" echo " ${config_overrides}" fi -export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH:/usr/local/nvidia/lib64" -ldconfig $LD_LIBRARY_PATH +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" echo "Added $LD_LIBRARY_PATH to ldconfig:" ldconfig -p | grep libcuda | sed 's/^/ /' echo "" @@ -47,7 +47,7 @@ echo "Logging to ${explicit_log_dir}" if [[ -n "${TOKENIZER_PATH}" ]]; then echo "Getting tokenizer files" - cp ${TOKENIZER_PATH}/* . + cp "${TOKENIZER_PATH}"/* . echo "" fi @@ -56,14 +56,22 @@ echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger # Create the nsys directory. -mkdir -p ${explicit_log_dir}/nsys - -if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then - echo "--- DEBUG libnccl-env.so ---" - ls -la /usr/local/gib/lib/libnccl-env.so || echo "libnccl-env.so not found" - ls -lh /usr/local/gib/lib - echo "----------------------------" +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN cd /opt rm -rf Megatron-Bridge @@ -71,17 +79,13 @@ git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git cd Megatron-Bridge git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af git submodule update --init --recursive -sed -i -e '/return config/i \ config.dist.distributed_timeout_minutes = 30' scripts/performance/run_recipe.py +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py ls cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ worker_command=$(cat <<- EOM if [ "\$RANK" -eq "0" ]; then - echo "--- LOCATING MEGATRON LIBRARIES ---" ; - python -c "import megatron.core; print('megatron.core:', megatron.core.__file__)" || echo "megatron.core not found" ; - python -c "import megatron.bridge; print('megatron.bridge:', megatron.bridge.__file__)" || echo "megatron.bridge not found" ; - echo "-----------------------------------" ; echo "Worker 0 is stalling for a few seconds.." ; sleep 3 ; echo "The detected environment within worker rank 0 is:" ; @@ -89,9 +93,8 @@ worker_command=$(cat <<- EOM fi ; cd /opt/Megatron-Bridge ; - export PYTHONPATH="/opt/Megatron-Bridge:/opt/Megatron-Bridge/3rdparty/Megatron-LM:\$PYTHONPATH" ; - exec numactl \ + numactl \ --cpunodebind=\$((LOCAL_RANK/4)) \ --membind=\$((LOCAL_RANK/4)) nsys profile \ -t nvtx,cuda \ @@ -100,7 +103,7 @@ worker_command=$(cat <<- EOM --capture-range=cudaProfilerApi \ --capture-range-end=stop \ --kill none \ - -o /${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ --force-overwrite true \ --session-new "nsys-\$RANDOM-\$RANK" \ nice -10 \ @@ -110,16 +113,19 @@ worker_command=$(cat <<- EOM --model_recipe_name deepseek_v3 \ --gpus_per_node 8 \ --num_gpus 256 \ + --compute_dtype bf16 \ + --seq_length 4096 \ --global_batch_size 2048 \ --micro_batch_size 1 \ - --seq_length 4096 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 16 \ + --expert_model_parallel_size 8 \ + --expert_tensor_parallel_size 1 \ --context_parallel_size 1 \ --virtual_pipeline_model_parallel_size None \ - --expert_model_parallel_size 8 \ - --compute_dtype bf16 \ - --max_steps 30 dist.distributed_timeout_minutes=30 + --recompute_modules mla_up_proj \ + --moe_a2a_overlap False \ + --max_steps 30 EOM ) @@ -138,10 +144,10 @@ torchrun \ if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then - mkdir -p ${ARTIFACT_DIR} - cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/ - env > ${ARTIFACT_DIR}/environ.txt - ls ${ARTIFACT_DIR} + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" fi echo "Training completed" echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-config-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-config-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-job.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-job.yaml similarity index 98% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-job.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-job.yaml index 54efbb6b..b4ffa210 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-job.yaml +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-job.yaml @@ -62,7 +62,7 @@ spec: gke-parallelstore/memory-limit: "0" {{- end }} {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} - kueue.x-k8s.io/podset-preferred-topology: {{ .Values.tasSettings.topologyRequest | default "kubernetes.io/hostname" }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} {{- end }} {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" @@ -245,7 +245,7 @@ spec: value: "{{ $gpusPerNode }}" - name: NCCL_PLUGIN_PATH - value: /usr/local/gib/lib64 + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 {{ if $root.Values.network.gibVersion }} - name: NCCL_INIT_SCRIPT diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-launcher-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-svc.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/templates/workload-svc.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/templates/workload-svc.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/values.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/values.yaml similarity index 64% rename from training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/values.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/values.yaml index 05e98e12..cb73da9b 100644 --- a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO26.02/recipe/values.yaml +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS2048/recipe/values.yaml @@ -1,35 +1,33 @@ +queue: null dwsSettings: maxRunDurationSeconds: null -network: - gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 - hostNetwork: true - ncclSettings: - - name: NCCL_DEBUG - value: INFO - - name: NCCL_TIMEOUT - value: '7200000' - subnetworks[]: null -queue: null tasSettings: topologyRequest: kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname volumes: - gcsMounts: - - bucketName: null - mountPath: null gcsVolumes: true psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null arguments[]: null configFile: custom_setup_experiment.py configPath: /workload/configs/ - defaultArguments[]: null envs: - - name: ARTIFACT_DIR - value: null - - name: GLOO_SOCKET_IFNAME - value: eth0 - - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH - value: /workload/configs/custom_setup_experiment.py - gpus: 256 - image: nvcr.io/nvidia/nemo:26.02 + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/Chart.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/Chart.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/Chart.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/Chart.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/README.md b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/README.md new file mode 100644 index 00000000..c6a46ae8 --- /dev/null +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/README.md @@ -0,0 +1,151 @@ + +# Pretrain deepseek_v3-fp8mx-gbs4096-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge + +This recipe outlines the steps for running a deepseek_v3 pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster. +- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total). +- GPU Architecture: NVIDIA Blackwell (B200). + +## Training dataset + +This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py) + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export GCS_BUCKET= # Note: path should not be prefixed with gs:// +export KUEUE_NAME= +``` + +Replace the following values: + +- ``: your Google Cloud project ID. +- ``: the region where your cluster is located. +- ``: the name of your GKE cluster. +- ``: the name of your Cloud Storage bucket. Don't include the gs:// prefix. +- ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4. + +Set the default project: + +```bash +gcloud config set project $PROJECT_ID +``` + +### Get cluster credentials + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe +cd $RECIPE_ROOT +``` + +### Configure and submit a pretraining job + +#### Using 32 nodes (256 gpus) fp8_mx precision + +To execute the job with the default settings, run the following command from your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=custom_setup_experiment.py \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096 + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=custom_setup_experiment.py \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096 +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-deepseek-v3-32node-fp8mx-seq4096-gbs4096 +``` diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/custom_setup_experiment.py b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/launcher.sh b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/launcher.sh new file mode 100644 index 00000000..78f9542a --- /dev/null +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/launcher.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name deepseek \ + --model_recipe_name deepseek_v3 \ + --gpus_per_node 8 \ + --num_gpus 256 \ + --seq_length 4096 \ + --compute_dtype fp8_mx \ + --global_batch_size 4096 \ + --micro_batch_size 1 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 16 \ + --context_parallel_size 1 \ + --expert_model_parallel_size 8 \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="32" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-config-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-config-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-job.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-launcher-configmap.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-svc.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-svc.yaml rename to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/templates/workload-svc.yaml diff --git a/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/values.yaml b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/values.yaml new file mode 100644 index 00000000..cb73da9b --- /dev/null +++ b/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-SEQ4096-GBS4096/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/Chart.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml rename to training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/Chart.yaml diff --git a/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/README.md b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/README.md new file mode 100644 index 00000000..501f727e --- /dev/null +++ b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/README.md @@ -0,0 +1,157 @@ + +# Pretrain gptoss-120b-bf16-gbs1280-gpus64 workloads on a4 GKE Node pools with Megatron-Bridge + +This recipe outlines the steps for running a gptoss-120b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your a4x GKE cluster. +- Node Configuration: 8 nodes (8 GPUs per node, 64 GPUs total). +- GPU Architecture: NVIDIA Blackwell. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py) + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export GCS_BUCKET= # Note: path should not be prefixed with gs:// +export KUEUE_NAME= +export HF_TOKEN= +``` + +Replace the following values: + +- ``: your Google Cloud project ID. +- ``: the region where your cluster is located. +- ``: the name of your GKE cluster. +- ``: the name of your Cloud Storage bucket. Don't include the gs:// prefix. +- ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4x. +- ``: your Hugging Face access token. + +Set the default project: + +```bash +gcloud config set project $PROJECT_ID +``` + +### Get cluster credentials + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe +cd $RECIPE_ROOT +``` + +### Configure and submit a pretraining job + +#### Using 8 nodes (64 gpus) bf16 precision + +To execute the job with the default settings, run the following command from your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-gptoss-120b-8node-bf16-gbs1280 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=custom_setup_experiment.py \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set workload.envs[3].name=HF_TOKEN \ +--set workload.envs[3].value=${HF_TOKEN} \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-gptoss-120b-8node-bf16-gbs1280 + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=custom_setup_experiment.py \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set workload.envs[3].name=HF_TOKEN \ + --set workload.envs[3].value=${HF_TOKEN} \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.train_iters=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-gptoss-120b-8node-bf16-gbs1280 +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-gptoss-120b-8node-bf16-gbs1280. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-gptoss-120b-8node-bf16-gbs1280-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-gptoss-120b-8node-bf16-gbs1280 +``` \ No newline at end of file diff --git a/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/custom_setup_experiment.py b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/launcher.sh b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/launcher.sh new file mode 100644 index 00000000..8e0b24da --- /dev/null +++ b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/launcher.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name gpt_oss \ + --model_recipe_name gpt_oss_120b \ + --gpus_per_node 8 \ + --num_gpus 64 \ + --seq_length 4096 \ + --compute_dtype bf16 \ + --global_batch_size 1280 \ + --micro_batch_size 4 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 1 \ + --context_parallel_size 1 \ + --expert_model_parallel_size 64 \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="8" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml rename to training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-job.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml rename to training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/templates/workload-svc.yaml diff --git a/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/values.yaml b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/values.yaml new file mode 100644 index 00000000..ebfad950 --- /dev/null +++ b/training/a4/gptoss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBS1280/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 64 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/Chart.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/Chart.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/Chart.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/Chart.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/README.md b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/README.md similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/README.md rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/README.md diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/launcher.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/launcher.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/launcher.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/launcher.sh diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/llama3-1-405b-seq8192-gbs128-mbs1-gpus128.py b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/llama3-1-405b-seq8192-gbs128-mbs1-gpus128.py similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/llama3-1-405b-seq8192-gbs128-mbs1-gpus128.py rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/llama3-1-405b-seq8192-gbs128-mbs1-gpus128.py diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/recipe_launch_command.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/recipe_launch_command.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/recipe_launch_command.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/recipe_launch_command.sh diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-job.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-job.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-job.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-job.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/templates/workload-svc.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/values.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/values.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/16node-fp8cs-seq8192-gbs128/recipe/values.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/16node-fp8cs-seq8192-gbs128/recipe/values.yaml diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/Chart.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/Chart.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/README.md b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/README.md similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/README.md rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/README.md diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs256-mbs1-gpus256.py b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs256-mbs1-gpus256.py similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs256-mbs1-gpus256.py rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs256-mbs1-gpus256.py diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-svc.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-svc.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/values.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/values.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/32node-fp8cs-seq8192-gbs256/recipe/values.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/32node-fp8cs-seq8192-gbs256/recipe/values.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/Chart.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/README.md b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/README.md similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/README.md rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/README.md diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/launcher.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/launcher.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/launcher.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/launcher.sh diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs2048-mbs1-gpus64.py b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs2048-mbs1-gpus64.py similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs2048-mbs1-gpus64.py rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/llama3-1-405b-seq8192-gbs2048-mbs1-gpus64.py diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/recipe_launch_command.sh diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/values.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/values.yaml similarity index 100% rename from training/a4/llama3-1-405b/nemo-pretraining-gke/8node-fp8cs-seq8192-gbs256/recipe/values.yaml rename to training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2511/8node-fp8cs-seq8192-gbs256/recipe/values.yaml diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/Chart.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/README.md b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/README.md new file mode 100644 index 00000000..fc9e8e33 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/README.md @@ -0,0 +1,153 @@ + +# Pretrain llama31-405b workloads on a4 GKE Node pools with Nvidia Megatron-Bridge Framework + +This recipe outlines the steps for running a llama31-405b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + export HF_TOKEN= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + - ``: Your HuggingFace token. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128 +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 16 node (128 gpus) fp8cs precision +To execute the job with the default settings, run the following command from +your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4-llama31-405b-16node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama31-405b-16node + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-llama31-405b-16node +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama31-405b-16node. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-llama31-405b-16node-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-llama31-405b-16node +``` diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/custom_setup_experiment.py b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/launcher.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/launcher.sh new file mode 100644 index 00000000..6af3d60f --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/launcher.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=${HF_TOKEN:-""} + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name llama \ + --model_recipe_name llama31_405b \ + --gpus_per_node 8 \ + --num_gpus 128 \ + --seq_length 8192 \ + --tensor_model_parallel_size 8 \ + --pipeline_model_parallel_size 8 \ + --context_parallel_size 1 \ + --virtual_pipeline_model_parallel_size 2 \ + --expert_tensor_parallel_size 1 \ + --global_batch_size 128 \ + --micro_batch_size 1 \ + --compute_dtype fp8_cs \ + --cuda_graph_impl none \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="16" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-config-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-job.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-svc.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/values.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/values.yaml new file mode 100644 index 00000000..80e6860b --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/16node-fp8cs-seq8192-gbs128/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 128 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/README.md b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/README.md new file mode 100644 index 00000000..46a97a24 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/README.md @@ -0,0 +1,153 @@ + +# Pretrain llama31-405b workloads on a4 GKE Node pools with Nvidia Megatron-Bridge Framework + +This recipe outlines the steps for running a llama31-405b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + export HF_TOKEN= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + - ``: Your HuggingFace token. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama31-405b/megatron-bridge-pretraining-gke/32node-FP8CS-GBSunknown/recipe +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 32 node (256 gpus) fp8cs precision +To execute the job with the default settings, run the following command from +your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4-llama31-405b-32node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama31-405b-32node + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-llama31-405b-32node +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama31-405b-32node. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-llama31-405b-32node-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-llama31-405b-32node +``` diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/custom_setup_experiment.py b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh new file mode 100644 index 00000000..608640e9 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/launcher.sh @@ -0,0 +1,154 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +if [[ -z "${HF_TOKEN}" ]]; then + echo "HF_TOKEN is not set. Please set it using export HF_TOKEN=" + exit 1 +fi + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name llama \ + --model_recipe_name llama31_405b \ + --gpus_per_node 8 \ + --num_gpus 256 \ + --seq_length 8192 \ + --tensor_model_parallel_size 8 \ + --pipeline_model_parallel_size 32 \ + --context_parallel_size 1 \ + --virtual_pipeline_model_parallel_size 2 \ + --expert_tensor_parallel_size 1 \ + --global_batch_size 256 \ + --micro_batch_size 1 \ + --compute_dtype fp8_cs \ + --cuda_graph_impl none \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="32" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/values.yaml b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/values.yaml new file mode 100644 index 00000000..cb73da9b --- /dev/null +++ b/training/a4/llama3-1-405b/nemo-pretraining-gke/nemo2602/32node-fp8cs-seq8192-gbs256/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/Chart.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/README.md b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/README.md new file mode 100644 index 00000000..180f09a8 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/README.md @@ -0,0 +1,155 @@ + +# Pretrain llama3-1-70b workloads on a4 GKE Node pools with Nvidia NeMo Framework + +This recipe outlines the steps for running a llama3-1-70b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [NeMo pretraining workload](https://github.com/NVIDIA/nemo). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the NeMo framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:25.11` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + export HF_TOKEN= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + - ``: Your HuggingFace token. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/recipe +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 32 node (256 gpus) bf16 precision +To execute the job with the default settings, run the following command from +your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4-llama3-1-70b +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=llama3-1-70b-bf16-seq8192-gbs256-gpus256.py \ +--set workload.image=nvcr.io/nvidia/nemo:25.11 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama3-1-70b + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=llama3-1-70b-bf16-seq8192-gbs256-gpus256.py \ + --set workload.image=nvcr.io/nvidia/nemo:25.11 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-llama3-1-70b +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b-32node. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-llama3-1-70b +``` diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/launcher.sh b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/launcher.sh new file mode 100644 index 00000000..4f8e4916 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/launcher.sh @@ -0,0 +1,106 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [ "$1" != "" ]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [ -z "${config_overrides}" ]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +if [[ -n "${NCCL_PLUGIN_PATH}" ]]; then + export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH" + ldconfig $LD_LIBRARY_PATH + echo "Added $LD_LIBRARY_PATH to ldconfig:" + ldconfig -p | grep libcuda | sed 's/^/ /' + echo "" +fi + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp ${TOKENIZER_PATH}/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +export HF_TOKEN= + +# Export the nemo2 config to yaml. +python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ +trainer.num_nodes="$NNODES" \ +log.explicit_log_dir="${explicit_log_dir}" \ +trainer.max_steps=30 \ +trainer.num_nodes=32 \ +trainer.devices=8 \ +${config_overrides} \ +--to-yaml exported_nemo_config.yaml + +# Create the nsys directory. +mkdir -p ${explicit_log_dir}/nsys + +OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \ +/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \ +-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \ +--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \ +--wait all \ +torchrun \ +--nproc-per-node="8" \ +--nnodes="${NNODES}" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ +trainer.num_nodes="$NNODES" \ +log.explicit_log_dir="${explicit_log_dir}" \ +trainer.max_steps=30 \ +trainer.num_nodes=32 \ +trainer.devices=8 \ +${config_overrides} + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p ${ARTIFACT_DIR} + cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/ + cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py + cp dllogger.json ${ARTIFACT_DIR}/dllogger.json + cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml + env > ${ARTIFACT_DIR}/environ.txt + ls ${ARTIFACT_DIR} +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/llama3-1-70b-bf16-seq8192-gbs256-gpus256.py b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/llama3-1-70b-bf16-seq8192-gbs256-gpus256.py new file mode 100644 index 00000000..529e41d3 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/llama3-1-70b-bf16-seq8192-gbs256-gpus256.py @@ -0,0 +1,142 @@ +"""Nemo2 pretraining recipe for Llama 3.1 70B model.""" + +from nemo.collections import llm +from nemo.collections.llm.recipes import llama31_70b +from nemo.lightning.pytorch.callbacks import NsysCallback +from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback +from nemo.utils.loggers.dllogger import DLLogger +import nemo_run as run +from scripts.performance.helpers import ( + set_primary_perf_configs, +) +from scripts.performance.utils import get_comm_overlap_callback_idx + + +def recipe( + profile_enabled: bool = False, + profile_start_step: int = 0, + profile_end_step: int = 0, + profile_ranks: str = "0", +) -> run.Partial: + """Returns a Nemo2 training recipe for Llama 3.1 70B model. + + Args: + profile_enabled: Whether to enable Nsys profiling. + profile_start_step: The step to start profiling. + profile_end_step: The step to end profiling. + profile_ranks: The ranks to profile, comma separated. + + Returns: + A Nemo2 training recipe. + """ + # Start from the Nemo standard recipe. + pretrain = llama31_70b.pretrain_recipe(performance_mode=True) + + num_nodes = 32 + num_gpus_per_node = 8 + mbs = 2 + gbs = 256 + max_steps = 30 + tp_size = 2 + pp_size = 4 + cp_size = 2 + vp_size = 5 # Virtual Pipeline Parallelism + ep_size = 1 # Expert Parallelism + enable_cuda_graphs = False + compute_dtype = "bf16" + fp8_recipe = None # Not needed for bf16 + nccl_communicator_config_path = None + use_mcore_fsdp = False + use_fsdp_double_buffer = False + use_user_buffer_registration = False + use_sharp = False + keep_fsdp_fp8_transpose_cache = False + + pretrain = set_primary_perf_configs( + pretrain, + "pre_train", + num_nodes=num_nodes, + num_gpus_per_node=num_gpus_per_node, + mbs=mbs, + gbs=gbs, + max_steps=max_steps, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + vp_size=vp_size, + ep_size=ep_size, + enable_cuda_graphs=enable_cuda_graphs, + compute_dtype=compute_dtype, + fp8_recipe=fp8_recipe, + nccl_communicator_config_path=nccl_communicator_config_path, + use_mcore_fsdp=use_mcore_fsdp, + use_fsdp_double_buffer=use_fsdp_double_buffer, + use_user_buffer_registration=use_user_buffer_registration, + use_sharp=use_sharp, + keep_fsdp_fp8_transpose_cache=keep_fsdp_fp8_transpose_cache, + ) + + # Sequence Length (model and data) + pretrain.model.config.seq_length = 8192 + pretrain.data.seq_length = 8192 + + # Set the number of steps to 50 for a quicker benchmark. + pretrain.trainer.max_steps = 50 + + # Disable validation batches. + pretrain.trainer.limit_val_batches = 0.0 + pretrain.trainer.val_check_interval = 100 + + # Add the Nsys profiling callback if enabled. + if profile_enabled: + pretrain.trainer.callbacks.append( + run.Config( + NsysCallback, + start_step=profile_start_step, + end_step=profile_end_step, + ranks=[int(x) for x in profile_ranks.split(",")], + gen_shape=False, + ) + ) + + # Add the FLOPs measurement callback. + pretrain.trainer.callbacks.append( + run.Config( + FLOPsMeasurementCallback, + model_name="llama31-70b", + model_config=pretrain.model.config, + data_config=pretrain.data, + ) + ) + + # When `performance_mode` is enabled, the Megatron communication overlap + # callback is already added to the recipe. + # https://github.com/NVIDIA-NeMo/NeMo/blob/90a396a567ebb4e8c1c41e454dc00cb71f911317/nemo/collections/llm/recipes/llama31_70b.py#L231 + comm_overlap_callback_idx = get_comm_overlap_callback_idx( + pretrain.trainer.callbacks + ) + pretrain.trainer.callbacks[ + comm_overlap_callback_idx + ].tp_comm_bootstrap_backend = "nccl" + + # Disable checkpointing. + pretrain.log.ckpt = None + pretrain.trainer.enable_checkpointing = False + + # Log every step. + pretrain.trainer.log_every_n_steps = 1 + + # Enable DLLogger + dllogger_config = run.Config( + DLLogger, + verbose=True, + stdout=True, + json_file="dllogger.json", + ) + pretrain.log.extra_loggers = [dllogger_config] + + return pretrain + + +if __name__ == "__main__": + run.cli.main(llm.pretrain, default_factory=recipe) diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/recipe_launch_command.sh b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/recipe_launch_command.sh new file mode 100644 index 00000000..2a6a9e05 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/recipe_launch_command.sh @@ -0,0 +1 @@ +helm install vishwasreddy-ubench-37r7 . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=llama3-1-70b-bf16-seq8192-gbs256-gpus256.py --set workload.image=nvcr.io/nvidia/nemo:25.11 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/vishwasreddy-ubench-37r7 \ No newline at end of file diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-config-configmap.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-job.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-svc.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/values.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/values.yaml new file mode 100644 index 00000000..6f6ed7b8 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs256/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:25.11 + defaultArguments[]: null + arguments[]: null + configFile: llama3-1-70b-bf16-seq8192-gbs256-gpus256.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NEMO_LAUNCH_SCRIPT + value: /workload/configs/llama3-1-70b-bf16-seq8192-gbs256-gpus256.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/README.md b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/README.md new file mode 100644 index 00000000..9af60ca9 --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/README.md @@ -0,0 +1,153 @@ + +# Pretrain llama3-70b workloads on a4 GKE Node pools with Nvidia Megatron-Bridge Framework + +This recipe outlines the steps for running a llama3-70b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + export HF_TOKEN= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + - ``: Your HuggingFace token. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 8 node (64 gpus) fp8 precision +To execute the job with the default settings, run the following command from +your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4-llama3-70b-8node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama3-70b-8node + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-llama3-70b-8node +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-70b-8node. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-llama3-70b-8node-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-llama3-70b-8node +``` diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..50a38067 --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=nsys_trace, + nsys_extra_args=nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh new file mode 100644 index 00000000..2c2ecb7e --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh @@ -0,0 +1,152 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=${HF_TOKEN:-""} + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name llama \ + --model_recipe_name llama3_70b \ + --gpus_per_node 8 \ + --num_gpus 64 \ + --seq_length 8192 \ + --tensor_model_parallel_size 4 \ + --pipeline_model_parallel_size 8 \ + --virtual_pipeline_model_parallel_size 5 \ + --context_parallel_size 1 \ + --use_megatron_fsdp false \ + --global_batch_size 256 \ + --micro_batch_size 1 \ + --compute_dtype fp8_mx \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope mlp,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="8" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/values.yaml b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/values.yaml new file mode 100644 index 00000000..ebfad950 --- /dev/null +++ b/training/a4/llama3-70b/nemo-pretraining-gke/nemo2602/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 64 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/Chart.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/README.md b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/README.md similarity index 98% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/README.md rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/README.md index c90287f7..2da1b528 100644 --- a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/README.md +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/README.md @@ -77,7 +77,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder. git clone https://github.com/ai-hypercomputer/gpu-recipes.git cd gpu-recipes export REPO_ROOT=`git rev-parse --show-toplevel` -export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe +export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe cd $RECIPE_ROOT ``` diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/custom_setup_experiment.py b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/custom_setup_experiment.py similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/custom_setup_experiment.py rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/custom_setup_experiment.py diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/launcher.sh b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/launcher.sh similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/launcher.sh rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/launcher.sh diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/recipe_launch_command.sh b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/recipe_launch_command.sh similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/recipe_launch_command.sh rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/recipe_launch_command.sh diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-job.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-job.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/templates/workload-job.yaml rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-job.yaml diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-svc.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/values.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/values.yaml similarity index 100% rename from training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/32node-BF16-GBS4096/recipe/values.yaml rename to training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2511/32node-BF16-GBS4096/recipe/values.yaml diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/Chart.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/README.md b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/README.md new file mode 100644 index 00000000..4a308ea8 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/README.md @@ -0,0 +1,157 @@ + +# Pretrain qwen3-235b-a22b-bf16-gbs4096-gpus128 workloads on a4 GKE Node pools with Megatron-Bridge + +This recipe outlines the steps for running a qwen3-235b-a22b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your a4x GKE cluster. +- Node Configuration: 16 nodes (8 GPUs per node, 128 GPUs total). +- GPU Architecture: NVIDIA Blackwell. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py) + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export GCS_BUCKET= # Note: path should not be prefixed with gs:// +export KUEUE_NAME= +export HF_TOKEN= +``` + +Replace the following values: + +- ``: your Google Cloud project ID. +- ``: the region where your cluster is located. +- ``: the name of your GKE cluster. +- ``: the name of your Cloud Storage bucket. Don't include the gs:// prefix. +- ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4x. +- ``: your Hugging Face access token. + +Set the default project: + +```bash +gcloud config set project $PROJECT_ID +``` + +### Get cluster credentials + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe +cd $RECIPE_ROOT +``` + +### Configure and submit a pretraining job + +#### Using 16 nodes (128 gpus) bf16 precision + +To execute the job with the default settings, run the following command from your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-qwen3-235b-16node-bf16-gbs4096 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=custom_setup_experiment.py \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set workload.envs[3].name=HF_TOKEN \ +--set workload.envs[3].value=${HF_TOKEN} \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-qwen3-235b-16node-bf16-gbs4096 + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=custom_setup_experiment.py \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set workload.envs[3].name=HF_TOKEN \ + --set workload.envs[3].value=${HF_TOKEN} \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.train_iters=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-qwen3-235b-16node-bf16-gbs4096 +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-qwen3-235b-16node-bf16-gbs4096. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-qwen3-235b-16node-bf16-gbs4096-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-qwen3-235b-16node-bf16-gbs4096 +``` diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/custom_setup_experiment.py b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/launcher.sh b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/launcher.sh new file mode 100644 index 00000000..4f7aa991 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/launcher.sh @@ -0,0 +1,150 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_235b_a22b \ + --gpus_per_node 8 \ + --num_gpus 128 \ + --seq_length 4096 \ + --compute_dtype bf16 \ + --global_batch_size 4096 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 8 \ + --virtual_pipeline_model_parallel_size 4 \ + --expert_model_parallel_size 8 \ + --expert_tensor_parallel_size 1 \ + --moe_a2a_overlap True \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="16" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-job.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-svc.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/values.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/values.yaml new file mode 100644 index 00000000..80e6860b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/16node-BF16-GBS4096/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 128 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/Chart.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/README.md b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/README.md new file mode 100644 index 00000000..b008e5f3 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/README.md @@ -0,0 +1,157 @@ + +# Pretrain qwen3-235b-a22b-bf16-gbs4096-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge + +This recipe outlines the steps for running a qwen3-235b-a22b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your a4x GKE cluster. +- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total). +- GPU Architecture: NVIDIA Blackwell. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py) + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export GCS_BUCKET= # Note: path should not be prefixed with gs:// +export KUEUE_NAME= +export HF_TOKEN= +``` + +Replace the following values: + +- ``: your Google Cloud project ID. +- ``: the region where your cluster is located. +- ``: the name of your GKE cluster. +- ``: the name of your Cloud Storage bucket. Don't include the gs:// prefix. +- ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4x. +- ``: your Hugging Face access token. + +Set the default project: + +```bash +gcloud config set project $PROJECT_ID +``` + +### Get cluster credentials + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe +cd $RECIPE_ROOT +``` + +### Configure and submit a pretraining job + +#### Using 32 nodes (256 gpus) bf16 precision + +To execute the job with the default settings, run the following command from your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-qwen3-235b-32node-bf16-seq4096-gbs4096 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=custom_setup_experiment.py \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set workload.envs[3].name=HF_TOKEN \ +--set workload.envs[3].value=${HF_TOKEN} \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-qwen3-235b-32node-bf16-seq4096-gbs4096 + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=custom_setup_experiment.py \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set workload.envs[3].name=HF_TOKEN \ + --set workload.envs[3].value=${HF_TOKEN} \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.train_iters=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-qwen3-235b-32node-bf16-seq4096-gbs4096 +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-qwen3-235b-32node-bf16-seq4096-gbs4096. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-qwen3-235b-32node-bf16-seq4096-gbs4096-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-qwen3-235b-32node-bf16-seq4096-gbs4096 +``` \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/custom_setup_experiment.py b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/launcher.sh b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/launcher.sh new file mode 100644 index 00000000..7e46e085 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/launcher.sh @@ -0,0 +1,150 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_235b_a22b \ + --gpus_per_node 8 \ + --num_gpus 256 \ + --seq_length 4096 \ + --compute_dtype bf16 \ + --global_batch_size 4096 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 8 \ + --virtual_pipeline_model_parallel_size 4 \ + --expert_model_parallel_size 8 \ + --expert_tensor_parallel_size 1 \ + --moe_a2a_overlap True \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="32" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-config-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-job.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-launcher-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-svc.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/values.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/values.yaml new file mode 100644 index 00000000..cb73da9b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-BF16-SEQ4096-GBS4096/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/Chart.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/README.md b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/README.md new file mode 100644 index 00000000..4d22166c --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/README.md @@ -0,0 +1,157 @@ + +# Pretrain qwen3-235b-a22b-fp8mx-gbs8192-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge + +This recipe outlines the steps for running a qwen3-235b-a22b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your a4x GKE cluster. +- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total). +- GPU Architecture: NVIDIA Blackwell. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py) + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export GCS_BUCKET= # Note: path should not be prefixed with gs:// +export KUEUE_NAME= +export HF_TOKEN= +``` + +Replace the following values: + +- ``: your Google Cloud project ID. +- ``: the region where your cluster is located. +- ``: the name of your GKE cluster. +- ``: the name of your Cloud Storage bucket. Don't include the gs:// prefix. +- ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4x. +- ``: your Hugging Face access token. + +Set the default project: + +```bash +gcloud config set project $PROJECT_ID +``` + +### Get cluster credentials + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe +cd $RECIPE_ROOT +``` + +### Configure and submit a pretraining job + +#### Using 32 nodes (256 gpus) fp8mx precision + +To execute the job with the default settings, run the following command from your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-qwen3-235b-32node-fp8mx-gbs8192 +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set-file workload_config=custom_setup_experiment.py \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set workload.envs[3].name=HF_TOKEN \ +--set workload.envs[3].value=${HF_TOKEN} \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-qwen3-235b-32node-fp8mx-gbs8192 + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=custom_setup_experiment.py \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set workload.envs[3].name=HF_TOKEN \ + --set workload.envs[3].value=${HF_TOKEN} \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.train_iters=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-qwen3-235b-32node-fp8mx-gbs8192 +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-qwen3-235b-32node-fp8mx-gbs8192. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-qwen3-235b-32node-fp8mx-gbs8192-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-qwen3-235b-32node-fp8mx-gbs8192 +``` \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/custom_setup_experiment.py b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..2337fdec --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/launcher.sh b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/launcher.sh new file mode 100644 index 00000000..a8ffe7d2 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/launcher.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN=YOUR_HF_TOKEN + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_235b_a22b \ + --gpus_per_node 8 \ + --num_gpus 256 \ + --seq_length 4096 \ + --compute_dtype fp8_mx \ + --global_batch_size 8192 \ + --micro_batch_size 2 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 8 \ + --context_parallel_size 1 \ + --expert_model_parallel_size 8 \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="32" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-config-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-job.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-launcher-configmap.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-svc.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/values.yaml b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/values.yaml new file mode 100644 index 00000000..cb73da9b --- /dev/null +++ b/training/a4/qwen3-235b-a22b/megatron-bridge-pretraining-gke/nemo2602/32node-FP8MX-GBS8192/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 256 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml new file mode 100644 index 00000000..af46c11a --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/README.md b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/README.md new file mode 100644 index 00000000..63b6a6f4 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/README.md @@ -0,0 +1,153 @@ + +# Pretrain qwen3-30b-a3b workloads on a4 GKE Node pools with Nvidia Megatron-Bridge Framework + +This recipe outlines the steps for running a qwen3-30b-a3b pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:26.02` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + export HF_TOKEN= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + - ``: Your HuggingFace token. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-30b-a3b/megatron-bridge-pretraining-gke/1node-FP8MX-GBSunknown/recipe +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 1 node (8 gpus) fp8mx precision +To execute the job with the default settings, run the following command from +your client: + +```bash +cd $RECIPE_ROOT +export WORKLOAD_NAME=$USER-a4-qwen3-30b-a3b-1node +helm install $WORKLOAD_NAME . -f values.yaml \ +--set-file workload_launcher=launcher.sh \ +--set workload.image=nvcr.io/nvidia/nemo:26.02 \ +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ +--set volumes.gcsMounts[0].mountPath=/job-logs \ +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ +--set queue=${KUEUE_NAME} +``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-qwen3-30b-a3b-1node + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set workload.image=nvcr.io/nvidia/nemo:26.02 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-qwen3-30b-a3b-1node +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-qwen3-30b-a3b-1node. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-qwen3-30b-a3b-1node-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-qwen3-30b-a3b-1node +``` diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..50a38067 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/custom_setup_experiment.py @@ -0,0 +1,327 @@ +import glob +import logging +import os +import sys +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor + from utils.utils import get_exp_name_config, select_config_variant_interactive +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + from .utils.utils import get_exp_name_config, select_config_variant_interactive + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin, PyTorchProfilerPlugin + from .resiliency_plugins import FaultTolerancePlugin + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + pytorch_profiler: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + ep_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + record_memory_history: bool, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nsys_trace: Optional[List[str]], + nsys_extra_args: Optional[List[str]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: Dict[str, str], + custom_srun_args: List[str], + custom_bash_cmds: List[List[str]], + nccl_ub: bool, + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + memory_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, + config_variant: str = "v1", +): + """Sets up the experiment and runs it.""" + if ( + model_family_name in ["qwen3"] + and model_recipe_name + in [ + "qwen3_30b_a3b", + "qwen3_235b_a22b", + ] + and task == "pretrain" + ): + assert hf_token is not None, "HF token is required for Qwen3 tokenizer. NullTokenizer to be used soon." + + if wandb_key is not None: + assert wandb_project_name is not None and wandb_experiment_name is not None, ( + "both wandb_project_name and wandb_experiment_name are required for logging with WandB" + ) + + if use_recipes: + script_name = ENTRYPOINT_RECIPE + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{model_recipe_name}_{task}_{num_gpus}gpu_{gpu}" + ) + + else: + script_name = ENTRYPOINT_PEFORMANCE + exp_config = get_exp_name_config( + args, model_family_name, model_recipe_name, gpu, compute_dtype, task, config_variant + ) + exp_name = ( + wandb_experiment_name + if wandb_experiment_name is not None + else f"{task}_{model_recipe_name}_{compute_dtype}_{exp_config}" + ) + + if pretrained_checkpoint is not None: + custom_mounts.append(f"{pretrained_checkpoint}:{pretrained_checkpoint}") + + import os + rank = os.environ.get('RANK', '0') + exp_name += f'_worker{rank}' + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + custom_mounts.extend( + [ + f"{run_script_path}:{run_script_path}", + f"{SCRIPT_DIR}:{SCRIPT_DIR}", + ] + ) + + if nccl_ub: + custom_env_vars.update({"NCCL_NVLS_ENABLE": "1", "NCCL_CTA_POLICY": "1"}) + + executor = run.LocalExecutor() + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + ep_size=ep_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + config_variant=config_variant, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + nsys_trace=nsys_trace, + nsys_extra_args=nsys_extra_args, + ) + ) + if pytorch_profiler: + plugins.append( + PyTorchProfilerPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + profile_ranks=profiling_ranks, + record_memory_history=record_memory_history, + ) + ) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + logger.info("Will launch the following command with Nemo-Run: %s", " ".join(nemorun_script.to_command())) + + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=dryrun, + detach=detach, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + assert not (args.enable_nsys and args.pytorch_profiler), ( + "Both NSys and PyTorch profiler cannot be enabled at the same time" + ) + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + # Handle --list_config_variants: show available variants and interactively select + config_variant = args.config_variant + if args.list_config_variants: + config_variant = select_config_variant_interactive( + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + gpu=args.gpu, + compute_dtype=args.compute_dtype, + task=args.task, + ) + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + pytorch_profiler=args.pytorch_profiler, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + ep_size=args.expert_model_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + record_memory_history=args.record_memory_history, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nsys_trace=args.nsys_trace, + nsys_extra_args=args.nsys_extra_args, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + custom_bash_cmds=args.custom_bash_cmds, + nccl_ub=args.nccl_ub, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + memory_params={ + "memory_threshold": args.memory_threshold, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + config_variant=config_variant, + ) \ No newline at end of file diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh new file mode 100644 index 00000000..2b856807 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/launcher.sh @@ -0,0 +1,153 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN= + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout f7a9428f301fa17ac374d5e7166a63b0aa4771af +git submodule update --init --recursive +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 60' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/4)) \ + --membind=\$((LOCAL_RANK/4)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu b200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_30b_a3b \ + --gpus_per_node 8 \ + --num_gpus 8 \ + --seq_length 4096 \ + --tensor_model_parallel_size 4 \ + --pipeline_model_parallel_size 1 \ + --virtual_pipeline_model_parallel_size None \ + --context_parallel_size 1 \ + --expert_tensor_parallel_size 1 \ + --use_megatron_fsdp false \ + --global_batch_size 512 \ + --micro_batch_size 8 \ + --compute_dtype fp8_mx \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="8" \ +--nnodes="1" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 00000000..a1d54cee --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{- if .Values.workload.configFile }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} +{{- end }} diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml new file mode 100644 index 00000000..b4ffa210 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-job.yaml @@ -0,0 +1,333 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + {{- end }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + {{- if $root.Values.workload.configFile }} + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + {{- end }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 00000000..7026e0f1 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml new file mode 100644 index 00000000..7cfe220b --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/values.yaml b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/values.yaml new file mode 100644 index 00000000..0bd19bd1 --- /dev/null +++ b/training/a4/qwen3_30b_a3b/nemo-pretraining-gke/nemo2602/recipe/values.yaml @@ -0,0 +1,33 @@ +queue: null +dwsSettings: + maxRunDurationSeconds: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsVolumes: true + psVolumes: false + gcsMounts: + - bucketName: null + mountPath: null +workload: + gpus: 8 + image: nvcr.io/nvidia/nemo:26.02 + defaultArguments[]: null + arguments[]: null + configFile: custom_setup_experiment.py + configPath: /workload/configs/ + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH + value: /workload/configs/custom_setup_experiment.py +network: + hostNetwork: true + subnetworks[]: null + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: WARN diff --git a/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/README.md b/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/README.md new file mode 100644 index 00000000..b604a892 --- /dev/null +++ b/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/README.md @@ -0,0 +1,107 @@ + +# Pretrain Llama 3.1 405B workloads on A4X Slurm Cluster with Nvidia Megatron-Bridge + +This recipe outlines the steps for running a Llama 3.1 405B pretraining workload on [Google Cloud A4X Slurm clusters](https://docs.cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster) by using [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Slurm Workload Manager](https://slurm.schedmd.com/) +- Deployment - [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- A4X Slurm Cluster (32 nodes, 128 GPUs) +- Machine Type: `a4x-highgpu-4g` +- Lustre Filesystem + +Please follow the instructions in the [Cluster Toolkit A4X Example README](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/machine-learning/a4x-highgpu-4g/README.md) to provision an A4X High Slurm cluster. + +## Docker container image + +This recipe uses the following container images: + +- `nvcr.io/nvidia/nemo:25.09` + +## Run the recipe + +From your cluster login node, complete the following steps: + +### Configure environment settings + +#### Setup Enroot for Megatron + +We recommend setting this up on Lustre. + +```bash + # Here, /home is a lustre filesystem +export BASE_DIR=/home/${USER} +export LOCAL_SSD_DIR=/mnt/localssd +cd ${BASE_DIR} + +# Configure Enroot +export ENROOT_CONFIG_PATH=${HOME}/.config/enroot +mkdir -p ${ENROOT_CONFIG_PATH} + +# Authenticate with Google Cloud Docker registry +gcloud auth configure-docker us-docker.pkg.dev + +# Import the NVIDIA NeMo container +mkdir -p ${BASE_DIR}/sqsh +enroot import --output ${BASE_DIR}/sqsh/nvidia+nemo+25.09.sqsh -- docker://nvcr.io#nvidia/nemo:25.09 +``` + +### Get the recipe + +Clone the Megatron-Bridge repository: + +```bash +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout minor_cfg_updates_2509 +cd ${BASE_DIR} +``` + +### Configure and submit a pretraining job + +#### Using 32 nodes (128 GPUs) FP8 precision + +The `submit.slurm` script is provided to run the training job. Ensure you are in your `${BASE_DIR}` and copy/create the `submit.slurm` script there. + +Create a logs directory to store the job output: + +```bash +mkdir -p ${BASE_DIR}/logs +``` + +To execute the job with the default settings, run the following command: + +```bash +sbatch ${BASE_DIR}/submit.slurm +``` + + +### Monitor the job + +To check the status of jobs in your queue, run the following command: + +```bash +squeue +``` + +To view the output logs, use `tail` on the output file generated by Slurm (replace `` with your actual job ID): + +```bash +tail -f logs/__.out +``` + +### Cancel the job + +To cancel a running job: + +```bash +scancel +``` diff --git a/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/submit.slurm b/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/submit.slurm new file mode 100644 index 00000000..2c7e6346 --- /dev/null +++ b/training/a4x/llama3-1-405b/megatron-bridge-pretraining-slurm/32node-FP8CS-GBS1024/submit.slurm @@ -0,0 +1,101 @@ +#!/bin/bash +#SBATCH --exclusive +#SBATCH --job-name=llama31-405b-pretrain +#SBATCH --nodes=32 +#SBATCH --ntasks-per-node=4 +#SBATCH --mem=0 +#SBATCH --segment=16 +#SBATCH --output=logs/%x_%u_%j.out +#SBATCH --time=24:00:00 +#SBATCH --open-mode=append + +set -e + +# SET UP MASTER ADDRESS +nodes=( $( scontrol show hostnames ${SLURM_JOB_NODELIST} ) ) +head_node=${nodes[0]} +export MASTER_ADDR=${head_node} +export MASTER_PORT=6002 +export NNODES=${SLURM_JOB_NUM_NODES} +echo "Master Node: ${MASTER_ADDR}" + +# EXPORT ENV VARS +export PYTHONUNBUFFERED=1 +export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" + +# MPI AND NETWORK +export PMIX_MCA_gds="^ds12" +export GLOO_SOCKET_IFNAME=enp0s3 + +# PATHS - Modify these if your setup differs +if [[ ! -d "${BASE_DIR}" ]]; then + echo "Error: BASE_DIR ${BASE_DIR} not found. Please ensure it exists." + exit 1 +fi +CONTAINER="${BASE_DIR}/sqsh/nvidia+nemo+25.09.sqsh" +REPO_DIR="${BASE_DIR}/Megatron-Bridge" +RUN_SCRIPT="${REPO_DIR}/scripts/performance/run_script.py" +CONFIG_FILE="${REPO_DIR}/scripts/performance/configs/llama31/llama31_405b_llm_pretrain.yaml" + +TOTAL_GPUS=$(( ${SLURM_JOB_NUM_NODES} * 4 )) + +# CACHE SETUP +CACHE_ROOT="${BASE_DIR}/cache/triton_cache" +INDUCTOR_ROOT="${BASE_DIR}/cache/torchinductor_cache" +JOB_CACHE_DIR="${CACHE_ROOT}/${SLURM_JOBID}" +JOB_INDUCTOR_DIR="${INDUCTOR_ROOT}/${SLURM_JOBID}" + +# EXECUTION +echo "Submitting training job on ${SLURM_JOB_NUM_NODES} nodes (${TOTAL_GPUS} GPUs)..." + +# Note: --container-mounts assumes paths exist on the host +srun \ + --container-image "${CONTAINER}" \ + --container-mounts "${BASE_DIR},/usr/local/gib:/usr/local/gib,${LOCAL_SSD_DIR}" \ + --container-workdir "${BASE_DIR}" \ + --no-container-mount-home \ + --container-writable \ + --gres=gpu:4 \ + -l \ + --segment 16 \ + --mpi=pmix \ + bash -c " + # Ensure clean slate inside container + unset NCCL_SOCKET_IFNAME + unset NCCL_IB_DISABLE + unset NCCL_GPUDIRECT_TCPX_FORCE_ACK + + # Distributed Ranks + export WORLD_SIZE=\${SLURM_NTASKS} + export RANK=\${SLURM_PROCID} + export LOCAL_RANK=\${SLURM_LOCALID} + export NODE_RANK=\${SLURM_NODEID} + + export TRITON_CACHE_DIR=${JOB_CACHE_DIR}/node_\${SLURM_NODEID} + export TORCHINDUCTOR_CACHE_DIR=${JOB_INDUCTOR_DIR}/node_\${SLURM_NODEID} + mkdir -p \${TRITON_CACHE_DIR} \${TORCHINDUCTOR_CACHE_DIR} + + + export LD_LIBRARY_PATH=/usr/local/gib/lib64:\${LD_LIBRARY_PATH} + + # Source environment scripts + if [ -f /usr/local/gib/scripts/set_nccl_env.sh ]; then + source /usr/local/gib/scripts/set_nccl_env.sh + fi + + echo \"print \$(hostname) executing rank \${SLURM_PROCID} (\${SLURM_JOBID}/\${SLURM_TASK_PID}) \${location}\" + + # PYTHON EXECUTION + numactl --cpunodebind=\$((\${SLURM_LOCALID}/2)) --membind=\$((\${SLURM_LOCALID}/2)) \ + python ${RUN_SCRIPT} \ + --config_file ${CONFIG_FILE} \ + --model_name llama31 \ + --model_size 405b \ + --compute_dtype fp8 \ + --fp8_recipe cs \ + --gpu gb200 \ + -a dummy -p dummy \ + -ng ${TOTAL_GPUS} \ + train.manual_gc=true \ + train.manual_gc_interval=100 + " diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/README.md b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/README.md new file mode 100644 index 00000000..d8c0972e --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/README.md @@ -0,0 +1,95 @@ + +# Pretrain Qwen 3 235B A22B workloads on A4X Slurm Cluster with Nvidia Megatron-Bridge + +This recipe outlines the steps for running a Qwen 3 235B A22B pretraining workload on [Google Cloud A4X Slurm clusters](https://docs.cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster) by using [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Slurm Workload Manager](https://slurm.schedmd.com/) +- Deployment - [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- A4X Slurm Cluster (16 nodes, 64 GPUs) +- Machine Type: A4X (GB200) +- Lustre Filesystem + +Please follow the instructions in the [Cluster Toolkit A4X Example README](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning) to provision an A4X Slurm cluster. + +## Docker container image + +This recipe uses the following container images: + +- `nvcr.io/nvidia/nemo:25.11` + +## Run the recipe + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + gcloud compute ssh $CLUSTER_NAME --project --zone $CLUSTER_REGION -- -o Hostname=nic0.$CLUSTER_NAME.$CLUSTER_REGION.c.$PROJECT_ID$.internal.gcpnode.com + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your SLURM cluster. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +From your cluster login node, complete the following steps: + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe +cd $RECIPE_ROOT +``` + +### Submit a pretraining job + +**Note:** Before running the recipe, please ensure you replace `` with your actual Hugging Face token in `launch_script.sh`. + +``` +cd .. +sbatch ./recipe/sbatch_script.sh +``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +squeue --me +``` + +To get the logs for the job, run the following command: + +``` +tail -f slurm_{jobID}.out +``` + +### Uninstall the job + +```bash +scancel -u $USER +``` diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/custom_setup_experiment.py b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..32173cbc --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/custom_setup_experiment.py @@ -0,0 +1,233 @@ +import glob +import logging +import os +from pathlib import Path +import sys +import time +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin + from .resiliency_plugins import FaultTolerancePlugin + +import logging + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: List[str], + custom_srun_args: List[str], + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, +): + logger.info("Hello World") + + rank = os.environ['RANK'] + + exp_name = f"{model_recipe_name}_{model_family_name}" + exp_name += f'_worker{rank}' + if use_recipes: + script_name = ENTRYPOINT_RECIPE + + else: + script_name = ENTRYPOINT_PEFORMANCE + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + ) + ) + + executor = run.LocalExecutor() + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=False, + detach=False, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + ) diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/launch_script.sh b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/launch_script.sh new file mode 100644 index 00000000..dfc0f282 --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/launch_script.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN= + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout 7695d4acbfac19353d20e456509117efe4733d6b +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 10' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/2)) \ + --membind=\$((LOCAL_RANK/2)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu gb200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_235b_a22b \ + --gpus_per_node 4 \ + --num_gpus 64 \ + --seq_length 4096 \ + --compute_dtype bf16 \ + --global_batch_size 1024 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 8 \ + --context_parallel_size 1 \ + --expert_model_parallel_size 8 \ + --virtual_pipeline_model_parallel_size 3 \ + --micro_batch_size 1 \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="4" \ +--nnodes="16" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/sbatch_script.sh b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/sbatch_script.sh new file mode 100644 index 00000000..0905337c --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/16node-BF16-GBS1024/recipe/sbatch_script.sh @@ -0,0 +1,54 @@ +#!/bin/bash +#SBATCH --job-name=megatron-pretrain-qwen3-235b-a22b +#SBATCH --nodes=16 +#SBATCH --ntasks-per-node=1 +#SBATCH --gres=gpu:4 +#SBATCH --mem=0 + +# Exit early on failures +set -e + +# Validate that the recipe location is setup correctly. +# Recipe is expected to be in "recipe" folder inside current working directory +RECIPE_DIR="$(pwd)/recipe" +LAUNCH_SCRIPT="${RECIPE_DIR}/launch_script.sh" +if [[ ! -f "${LAUNCH_SCRIPT}" ]]; then + echo "Error: Recipe is not located correctly. The recipe is expected to be in "recipe" folder inside current working directory. We could not find the launch script there." >&2 + exit 1 +fi +chmod +x "${LAUNCH_SCRIPT}" + +# Enroot the image if it is not already enrooted. +export ENROOT_CONFIG_PATH=${HOME}/.config/enroot +ORIG_IMAGE=nvcr.io#nvidia/nemo:25.11 +SQSH_IMAGE_PATH=${RECIPE_DIR}/sqsh/nvcr.io_nvidia_nemo:25.11 +if [[ ! -f "${SQSH_IMAGE_PATH}" ]]; then + mkdir -p "$(dirname "${SQSH_IMAGE_PATH}")" + echo "enrooting $ORIG_IMAGE to ${SQSH_IMAGE_PATH}" + enroot import --output "${SQSH_IMAGE_PATH}" -- "docker://${ORIG_IMAGE}" +fi + +# get the master node +master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) +master_port=29500 + +ARTIFACT_DIR_HOME="/home/$USER/job_artifacts/${SLURM_JOB_ID}" +mkdir -p "$ARTIFACT_DIR_HOME" + +export NNODES=$SLURM_NNODES +export MASTER_ADDR=$master_addr +export MASTER_PORT=$master_port +export ARTIFACT_DIR=/artifacts +export JOB_NAME=megatron-pretrain-qwen3-235b-a22b +export JOB_IDENTIFIER=megatron-pretrain-qwen3-235b-a22b +export CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH=/recipe/custom_setup_experiment.py + + +export PMIX_MCA_gds="^ds12" +export GLOO_SOCKET_IFNAME=enp0s3 + +srun --container-image="$SQSH_IMAGE_PATH" \ + --container-mounts="${RECIPE_DIR}:/recipe:mkdir,${ARTIFACT_DIR_HOME}:${ARTIFACT_DIR}:mkdir,/usr/local/gib:/usr/local/gib" \ + --container-workdir=/recipe \ + --container-writable \ + bash -c 'export JOB_COMPLETION_INDEX=$SLURM_NODEID; ./launch_script.sh' diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/README.md b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/README.md new file mode 100644 index 00000000..e5f17662 --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/README.md @@ -0,0 +1,95 @@ + +# Pretrain Qwen 3 235B A22B workloads on A4X Slurm Cluster with Nvidia Megatron-Bridge + +This recipe outlines the steps for running a Qwen 3 235B A22B pretraining workload on [Google Cloud A4X Slurm clusters](https://docs.cloud.google.com/ai-hypercomputer/docs/create/create-slurm-cluster) by using [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Slurm Workload Manager](https://slurm.schedmd.com/) +- Deployment - [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- A4X Slurm Cluster (32 nodes, 128 GPUs) +- Machine Type: A4X (GB200) +- Lustre Filesystem + +Please follow the instructions in the [Cluster Toolkit A4X Example README](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/machine-learning) to provision an A4X Slurm cluster. + +## Docker container image + +This recipe uses the following container images: + +- `nvcr.io/nvidia/nemo:25.11` + +## Run the recipe + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + gcloud compute ssh $CLUSTER_NAME --project --zone $CLUSTER_REGION -- -o Hostname=nic0.$CLUSTER_NAME.$CLUSTER_REGION.c.$PROJECT_ID$.internal.gcpnode.com + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your SLURM cluster. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +From your cluster login node, complete the following steps: + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe +cd $RECIPE_ROOT +``` + +### Submit a pretraining job + +**Note:** Before running the recipe, please ensure you replace `` with your actual Hugging Face token in `launch_script.sh`. + +``` +cd .. +sbatch ./recipe/sbatch_script.sh +``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +squeue --me +``` + +To get the logs for the job, run the following command: + +``` +tail -f slurm_{jobID}.out +``` + +### Uninstall the job + +```bash +scancel -u $USER +``` diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/custom_setup_experiment.py b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/custom_setup_experiment.py new file mode 100644 index 00000000..32173cbc --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/custom_setup_experiment.py @@ -0,0 +1,233 @@ +import glob +import logging +import os +from pathlib import Path +import sys +import time +from typing import Any, Dict, List, Optional + +import nemo_run as run +from nemo_run.config import get_nemorun_home + + +try: + from argument_parser import parse_cli_args + from utils.evaluate import calc_convergence_and_performance + from utils.executors import dgxc_executor, slurm_executor +except (ImportError, ModuleNotFoundError): + from .argument_parser import parse_cli_args + from .utils.evaluate import calc_convergence_and_performance + from .utils.executors import dgxc_executor, slurm_executor + +try: + import wandb + + HAVE_WANDB = True +except (ImportError, ModuleNotFoundError): + HAVE_WANDB = False + +try: + from perf_plugins import NsysPlugin, PerfEnvPlugin + from resiliency_plugins import FaultTolerancePlugin +except (ImportError, ModuleNotFoundError): + from .perf_plugins import NsysPlugin, PerfEnvPlugin + from .resiliency_plugins import FaultTolerancePlugin + +import logging + + +SCRIPT_DIR = Path(__file__).parent.resolve() +ENTRYPOINT_PEFORMANCE = "run_script.py" +ENTRYPOINT_RECIPE = "run_recipe.py" + +logging.basicConfig(level=logging.DEBUG) +logger = logging.getLogger(__name__) + + +def main( + use_recipes: bool, + model_family_name: str, + model_recipe_name: str, + task: str, + compute_dtype: str, + gpu: str, + hf_token: str, + detach: bool, + dryrun: bool, + enable_vboost: bool, + enable_nsys: bool, + moe_a2a_overlap: bool, + tp_size: Optional[int], + pp_size: Optional[int], + cp_size: Optional[int], + wandb_key: str, + wandb_project_name: str, + wandb_experiment_name: str, + wandb_entity_name: str, + profiling_start_step: int, + profiling_stop_step: int, + profiling_gpu_metrics: bool, + profiling_ranks: Optional[List[int]], + nemo_home: str, + account: str, + partition: str, + log_dir: str, + gpus_per_node: int, + time_limit: str, + container_image: str, + custom_mounts: List[str], + custom_env_vars: List[str], + custom_srun_args: List[str], + pretrained_checkpoint: Optional[str], + num_gpus: int, + is_long_convergence_run: bool, + additional_slurm_params: Optional[Dict[str, Any]], + golden_values_path: str, + convergence_params: Dict[str, Any], + performance_params: Dict[str, Any], + max_retries: int, + dgxc_base_url: str, + dgxc_cluster: str, + dgxc_kube_apiserver_url: str, + dgxc_app_id: str, + dgxc_app_secret: str, + dgxc_project_name: str, + dgxc_pvc_claim_name: str, + dgxc_pvc_mount_path: str, +): + logger.info("Hello World") + + rank = os.environ['RANK'] + + exp_name = f"{model_recipe_name}_{model_family_name}" + exp_name += f'_worker{rank}' + if use_recipes: + script_name = ENTRYPOINT_RECIPE + + else: + script_name = ENTRYPOINT_PEFORMANCE + + run_script_path = SCRIPT_DIR / script_name + logger.info(f"Run script path: {run_script_path}") + if not run_script_path.is_file(): + logger.error(f"Specified run script not found: {run_script_path}") + sys.exit(1) + + nemorun_script = run.Script( + path=str(run_script_path), + entrypoint="python", + env={"PYTHONPATH": f"{SCRIPT_DIR}:$PYTHONPATH"}, + args=list(sys.argv[1:]), + ) + + plugins = [] + + if not use_recipes: + plugins.append( + PerfEnvPlugin( + enable_vboost=enable_vboost, + moe_a2a_overlap=moe_a2a_overlap, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + model_family_name=model_family_name, + model_recipe_name=model_recipe_name, + gpu=gpu, + compute_dtype=compute_dtype, + train_task=task, + ) + ) + + if enable_nsys: + plugins.append( + NsysPlugin( + profile_step_start=profiling_start_step, + profile_step_end=profiling_stop_step, + nsys_gpu_metrics=profiling_gpu_metrics, + profile_ranks=profiling_ranks, + ) + ) + + executor = run.LocalExecutor() + run.run( + nemorun_script, + executor=executor, + plugins=plugins, + dryrun=False, + detach=False, + name=exp_name, + ) + + +if __name__ == "__main__": + parser = parse_cli_args() + args, unknown_args = parser.parse_known_args() + + # probably better to use parser.parse_args() and make unknowns an error, + # but for now we'll just issue a warning. + if unknown_args: + logger.warning(f"Ignoring unrecognized arguments: {' '.join(unknown_args)}") + + main( + use_recipes=args.use_recipes, + model_family_name=args.model_family_name, + model_recipe_name=args.model_recipe_name, + task=args.task, + compute_dtype=args.compute_dtype, + gpu=args.gpu, + hf_token=args.hf_token, + detach=args.detach, + dryrun=args.dryrun, + enable_vboost=args.enable_vboost, + enable_nsys=args.enable_nsys, + moe_a2a_overlap=args.moe_a2a_overlap, + tp_size=args.tensor_model_parallel_size, + pp_size=args.pipeline_model_parallel_size, + cp_size=args.context_parallel_size, + wandb_key=args.wandb_key, + wandb_project_name=args.wandb_project_name, + wandb_experiment_name=args.wandb_experiment_name, + wandb_entity_name=args.wandb_entity_name, + profiling_start_step=args.profiling_start_step, + profiling_stop_step=args.profiling_stop_step, + profiling_gpu_metrics=args.profiling_gpu_metrics, + profiling_ranks=args.profiling_ranks, + nemo_home=args.nemo_home, + account=args.account, + partition=args.partition, + log_dir=args.log_dir, + gpus_per_node=args.gpus_per_node, + time_limit=args.time_limit, + container_image=args.container_image, + custom_mounts=args.custom_mounts, + custom_env_vars=args.custom_env_vars, + custom_srun_args=args.custom_srun_args, + pretrained_checkpoint=args.pretrained_checkpoint, + num_gpus=args.num_gpus, + is_long_convergence_run=args.is_long_convergence_run, + additional_slurm_params=args.additional_slurm_params, + golden_values_path=args.golden_values_path, + convergence_params={ + "correlation_threshold": args.correlation_threshold, + "high_loss_tolerance": args.high_loss_tolerance, + "medium_loss_tolerance": args.medium_loss_tolerance, + "low_loss_tolerance": args.low_loss_tolerance, + "final_loss_tolerance": args.final_loss_tolerance, + "max_outlier_ratio": args.max_outlier_ratio, + "outlier_threshold": args.outlier_threshold, + "skip_first_percent_loss": args.skip_first_percent_loss, + }, + performance_params={ + "timing_threshold": args.timing_threshold, + "skip_first_percent_time": args.skip_first_percent_time, + }, + max_retries=args.max_retries, + dgxc_base_url=args.dgxc_base_url, + dgxc_cluster=args.dgxc_cluster, + dgxc_kube_apiserver_url=args.dgxc_kube_apiserver_url, + dgxc_app_id=args.dgxc_app_id, + dgxc_app_secret=args.dgxc_app_secret, + dgxc_project_name=args.dgxc_project_name, + dgxc_pvc_claim_name=args.dgxc_pvc_claim_name, + dgxc_pvc_mount_path=args.dgxc_pvc_mount_path, + ) diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/launch_script.sh b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/launch_script.sh new file mode 100644 index 00000000..c0c9f7aa --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/launch_script.sh @@ -0,0 +1,151 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [[ "$1" != "" ]]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [[ -z "${config_overrides[*]}" ]]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:$NCCL_PLUGIN_PATH:$LD_LIBRARY_PATH" +ldconfig "$LD_LIBRARY_PATH" +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp "${TOKENIZER_PATH}"/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + +# Create the nsys directory. +mkdir -p "${explicit_log_dir}/nsys" + +# Collect diagnostics to a single line +kv="\"kernel_version\": \"$(uname --kernel-release)\"" +if command -v nvidia-smi &> /dev/null; then + cuda_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + driver_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' || true) + vbios_v=$(nvidia-smi -q -x | grep -Po '(?<=).*(?=)' | head -n1 || true) + kv="${kv}, \"cuda_version\": \"${cuda_v}\"" + kv="${kv}, \"driver_version\": \"${driver_v}\"" + kv="${kv}, \"vbios_version\": \"${vbios_v}\"" +fi +echo "VERSION_DIAGNOSTICS: {${kv}}" + + +export HF_TOKEN="" + +cd /opt +rm -rf Megatron-Bridge +git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git +cd Megatron-Bridge +git checkout 7695d4acbfac19353d20e456509117efe4733d6b +sed -i -e '/pretrain(config=recipe/i \ recipe.dist.distributed_timeout_minutes = 10' scripts/performance/run_script.py +ls + +cp $CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH scripts/performance/ + +worker_command=$(cat <<- EOM + if [ "\$RANK" -eq "0" ]; then + echo "Worker 0 is stalling for a few seconds.." ; + sleep 3 ; + echo "The detected environment within worker rank 0 is:" ; + env | sed 's/^/ /' ; + fi ; + + cd /opt/Megatron-Bridge ; + + numactl \ + --cpunodebind=\$((LOCAL_RANK/2)) \ + --membind=\$((LOCAL_RANK/2)) nsys profile \ + -t nvtx,cuda \ + --cuda-event-trace=false \ + --sample=none \ + --capture-range=cudaProfilerApi \ + --capture-range-end=stop \ + --kill none \ + -o "/${explicit_log_dir}/$JOB_IDENTIFIER/rank-\$RANK" \ + --force-overwrite true \ + --session-new "nsys-\$RANDOM-\$RANK" \ + nice -10 \ + python scripts/performance/custom_setup_experiment.py \ + --gpu gb200 \ + --model_family_name qwen \ + --model_recipe_name qwen3_235b_a22b \ + --gpus_per_node 4 \ + --num_gpus 128 \ + --seq_length 4096 \ + --compute_dtype bf16 \ + --global_batch_size 2048 \ + --tensor_model_parallel_size 1 \ + --pipeline_model_parallel_size 8 \ + --context_parallel_size 1 \ + --expert_model_parallel_size 16 \ + --virtual_pipeline_model_parallel_size 3 \ + --micro_batch_size 1 \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope moe_router,moe_preprocess,attn \ + --max_steps 30 + +EOM +) + +echo "$worker_command" > worker_command.sh +chmod 777 worker_command.sh + +torchrun \ +--nproc-per-node="4" \ +--nnodes="32" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +--no-python bash worker_command.sh + + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p "${ARTIFACT_DIR}" + cp -r "${explicit_log_dir}"/* "${ARTIFACT_DIR}/" + env > "${ARTIFACT_DIR}/environ.txt" + ls "${ARTIFACT_DIR}" +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/sbatch_script.sh b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/sbatch_script.sh new file mode 100644 index 00000000..021e9744 --- /dev/null +++ b/training/a4x/qwen3-235b-a22b/megatron-bridge-pretraining-slurm/32node-BF16-GBS2048/recipe/sbatch_script.sh @@ -0,0 +1,57 @@ +#!/bin/bash +#SBATCH --job-name=agent-qwen3-235-ubench-gsen +#SBATCH --nodes=32 +#SBATCH --ntasks-per-node=1 +#SBATCH --gres=gpu:4 +#SBATCH --mem=0 +#SBATCH --segment=16 + +# Exit early on failures +set -e + +# Validate that the recipe location is setup correctly. +# Recipe is expected to be in "recipe" folder inside current working directory +RECIPE_DIR="$(pwd)/recipe" +LAUNCH_SCRIPT="${RECIPE_DIR}/launch_script.sh" +if [[ ! -f "${LAUNCH_SCRIPT}" ]]; then + echo "Error: Recipe is not located correctly. The recipe is expected to be in "recipe" folder inside current working directory. We could not find the launch script there." >&2 + exit 1 +fi +chmod +x "${LAUNCH_SCRIPT}" + +# Enroot the image if it is not already enrooted. +export ENROOT_CONFIG_PATH=${HOME}/.config/enroot +ORIG_IMAGE=nvcr.io#nvidia/nemo:25.11 +SQSH_IMAGE_PATH=${RECIPE_DIR}/sqsh/nvcr.io_nvidia_nemo:25.11 +if [[ ! -f "${SQSH_IMAGE_PATH}" ]]; then + mkdir -p "$(dirname "${SQSH_IMAGE_PATH}")" + echo "enrooting $ORIG_IMAGE to ${SQSH_IMAGE_PATH}" + enroot import --output "${SQSH_IMAGE_PATH}" -- "docker://${ORIG_IMAGE}" +fi + +# get the master node +master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) +master_port=29500 + +ARTIFACT_DIR_HOME="/home/$USER/job_artifacts/${SLURM_JOB_ID}" +mkdir -p "$ARTIFACT_DIR_HOME" + +export NNODES=$SLURM_NNODES +export MASTER_ADDR=$master_addr +export MASTER_PORT=$master_port +export ARTIFACT_DIR=/artifacts +export JOB_NAME=agent-qwen3-235-ubench-gsen +export JOB_IDENTIFIER=agent-qwen3-235-ubench-gsen +export CUSTOM_SETUP_EXPERIMENT_SCRIPT_PATH=/recipe/custom_setup_experiment.py + + +export PMIX_MCA_gds="^ds12" +export GLOO_SOCKET_IFNAME=enp0s3 + +srun --container-image="$SQSH_IMAGE_PATH" \ + --container-mounts="${RECIPE_DIR}:/recipe:mkdir,${ARTIFACT_DIR_HOME}:${ARTIFACT_DIR}:mkdir,/usr/local/gib:/usr/local/gib" \ + --container-workdir=/recipe \ + --container-writable \ + --error="error_log_%N.err" \ + --label \ + bash -c 'export JOB_COMPLETION_INDEX=$SLURM_NODEID; ./launch_script.sh'