diff --git a/README.md b/README.md index a45289d..05f4a09 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,13 @@ Documentation is available at https://kiro.dev/docs/powers/ --- +### sagemaker-ai +**Amazon SageMaker AI** - Deploy and train ML models on Amazon SageMaker AI — inference endpoints, LLM fine-tuning, HyperPod clusters, Model Monitor, AutoML with AutoGluon, and SageMaker Python SDK v3 patterns. + +**MCP Servers:** awslabs.sagemaker-ai-mcp-server + +--- + ### cloud-architect **Build infrastructure on AWS** - Build AWS infrastructure with CDK in Python following AWS Well-Architected framework best practices. diff --git a/sagemaker-ai/POWER.md b/sagemaker-ai/POWER.md new file mode 100644 index 0000000..4594f1b --- /dev/null +++ b/sagemaker-ai/POWER.md @@ -0,0 +1,170 @@ +--- +name: "sagemaker-ai" +displayName: "Amazon SageMaker AI" +description: "Deploy and train ML models on Amazon SageMaker AI — inference endpoints, LLM fine-tuning, HyperPod clusters, Model Monitor, AutoML with AutoGluon, and SageMaker Python SDK v3 patterns." +keywords: ["sagemaker", "inference", "training", "hyperpod", "model-monitor", "autogluon", "sdk-v3"] +author: "AWS" +--- + +# When to use this power + +When you want to deploy ML models to SageMaker inference endpoints, fine-tune or train models (serverless, Training Jobs, or HyperPod), set up model monitoring, run AutoML with AutoGluon, or write correct SageMaker Python SDK v3 code. + +# When to Load Steering Files + +Whenever you are asked to perform a task related to any of the following scenarios - ensure you load and read the appropriate markdown file mentioned + +- Deploying models to SageMaker real-time endpoints (container selection, DJL LMI/vLLM, DLC configuration, multimodal models) -> use `./steering/inference-endpoints.md` +- Fine-tuning or training models (serverless customization, QLoRA/LoRA, GPU vs Trainium, instance sizing) -> use `./steering/training-jobs.md` +- Setting up or managing HyperPod clusters for training (EKS orchestration, Training Operator, Task Governance, resiliency) -> use `./steering/hyperpod.md` — also use MCP tools `manage_hyperpod_stacks` and `manage_hyperpod_cluster_nodes` for cluster operations +- Deploying inference on HyperPod clusters (JumpStart models, custom models from S3/FSx, kubectl CRDs, autoscaling) -> use `./steering/hyperpod-inference.md` +- Setting up model monitoring (Data Quality, Model Quality, Bias, Explainability, baselines, schedules) -> use `./steering/model-monitor.md` +- Using AutoGluon for AutoML (tabular, time series, multimodal, SageMaker Pipelines) -> use `./steering/automl-autogluon.md` +- Writing SageMaker SDK v3 code (correct imports, image_uris, deployment patterns, invocation) -> use `./steering/sdk-v3-reference.md` + +# Available MCP Tools + +This power integrates with the [Amazon SageMaker AI MCP Server](https://github.com/awslabs/mcp/tree/main/src/sagemaker-ai-mcp-server) (Apache-2.0 license) for HyperPod cluster management. + +## manage_hyperpod_stacks + +Orchestrates HyperPod cluster infrastructure via CloudFormation. + +- `operation="describe"` — Describe existing HyperPod CloudFormation stacks (read-only) +- `operation="deploy"` — Deploy a new HyperPod cluster via managed CloudFormation templates (requires `--allow-write`) +- `operation="delete"` — Delete a HyperPod CloudFormation stack and its resources (requires `--allow-write`) + +Parameters: `operation`, `stack_name`, `region_name`, `profile_name`, `params_file` (for deploy) + +**Safety**: Only modifies/deletes stacks originally created by this tool. + +## manage_hyperpod_cluster_nodes + +Manages cluster nodes within HyperPod deployments. + +- `operation="list_clusters"` — List HyperPod clusters with filtering by name, creation time, training plan ARN +- `operation="list_nodes"` — List nodes in a cluster with filtering by instance group +- `operation="describe_node"` — Get detailed info about a specific node +- `operation="update_software"` — Update AMIs for all nodes or specific instance groups (requires `--allow-write`) +- `operation="batch_delete"` — Delete multiple nodes in a single operation (requires `--allow-write`) + +Parameters: `operation`, `cluster_name`, `node_id` (for describe_node), `node_ids` (for batch_delete) + +# Onboarding + +1. **Ensure the user has valid AWS Credentials** These are used to interact with SageMaker and related AWS services. +2. **Verify AWS CLI access** Using `aws sts get-caller-identity` +3. **Check Python and SDK version** SageMaker Python SDK v3 requires Python <= 3.13. Install with `pip install 'sagemaker>=3'` +4. **Execution role** In SageMaker Studio, always use `get_execution_role()`. Outside Studio, ensure an IAM role with SageMaker permissions is available. +5. **MCP server dependencies** The SageMaker AI MCP server requires [`uv`](https://docs.astral.sh/uv/getting-started/installation/) (for `uvx` command) + +# SDK v3 Guardrails (Always-On) + +These rules apply to EVERY piece of SageMaker code. Check each one before generating output. + +| Rule | CORRECT | WRONG | +|------|---------|-------| +| Model class import | `from sagemaker.core.resources import Model` | `from sagemaker.model import Model` | +| Session import | `from sagemaker.core.helper.session_helper import Session, get_execution_role` | `from sagemaker import get_execution_role` | +| Deployment (LLMs) | Core API: `Model.create` + `EndpointConfig.create` + `Endpoint.create` | `ModelBuilder` with DJL/vLLM containers | +| Deployment (simple) | `ModelBuilder` + `SchemaBuilder` | V2 `Model.deploy()` | +| JumpStart deploy | `ModelBuilder(model="")` | `JumpStartModel` (removed in V3) | +| Processing imports | `from sagemaker.core.processing import ...` | `from sagemaker.processing import ...` | +| Transformer import | `from sagemaker.core.transformer import Transformer` | `from sagemaker.transformer import Transformer` | +| Pipeline imports | `sagemaker.mlops.workflow.*` + `sagemaker.core.workflow.*` | `sagemaker.workflow.*` (V2, removed) | +| Python version | <= 3.13 | 3.14+ (SDK v3 incompatible) | +| DLC images source | `https://aws.github.io/deep-learning-containers/reference/available_images/` | HuggingFace docs (outdated) | + +# Quick Reference + +## SDK v3 Core API (Recommended for DJL/vLLM) + +```python +from sagemaker.core.resources import Model, EndpointConfig, Endpoint +from sagemaker.core.shapes.shapes import ContainerDefinition, ProductionVariant +``` + +## SDK v3 ModelBuilder (For JumpStart / simple HF models) + +```python +from sagemaker.serve import ModelBuilder +from sagemaker.serve.builder.schema_builder import SchemaBuilder +``` + +## Container Decision Tree + +1. Standard text LLM (Llama, Mistral, Qwen) -> DJL LMI with vLLM backend +2. Multimodal/Vision model (Idefics3, LLaVA, Qwen-VL) -> DJL LMI with vLLM backend +3. Simple HF pipeline model (classification, NER) -> HuggingFace Inference DLC +4. Custom model with custom handler -> HuggingFace Inference DLC + custom inference.py in model.tar.gz + +## CUDA Compatibility + +| Instance Family | GPU | Max CUDA | Recommended DLC | +|---|---|---|---| +| ml.g5.* | A10G (24GB) | cu128 | `djl-inference:0.36.0-lmi20.0.0-cu128` | +| ml.g6.* | L4 (24GB) | cu129 | `djl-inference:0.36.0-lmi22.0.0-cu129` | +| ml.p5.* | H100 (80GB) | cu129 | `djl-inference:0.36.0-lmi22.0.0-cu129` | + +Note: cu129 fails on g5 due to driver version mismatch, not CUDA compute capability. Always check the DLC images page for the latest versions — the versions above are examples. + +## Reference Repositories + +- Inference hosting examples (Llama, Mistral, Mixtral, Falcon, CodeLlama): `https://github.com/aws-samples/sagemaker-genai-hosting-examples` +- Model customization recipes (QLoRA, Spectrum, Full FT, DPO, GRPO): `https://github.com/aws-samples/amazon-sagemaker-generativeai/tree/main/0_model_customization_recipes` + +## Latest DLC Images + +Always check the latest images at: +`https://aws.github.io/deep-learning-containers/reference/available_images/` + +Do NOT rely on the HuggingFace docs page — it is outdated. + +# Best Practices + +- Always use `get_execution_role()` in SageMaker Studio — don't hardcode role ARNs +- Set `ContainerStartupHealthCheckTimeoutInSeconds=900` for models that download from HuggingFace Hub at startup +- Clean up endpoints after testing — they cost money even when idle +- Use `dependencies={"auto": False}` with ModelBuilder to avoid capturing local packages that break the DLC +- For first invocation after deploy, add retry logic or a warm-up wait — models load lazily on first request + +# Troubleshooting + +## SDK v3 Import Errors + +**Error**: `ImportError: cannot import name 'Model' from 'sagemaker.model'` +**Cause**: Using SDK v2 import paths +**Solution**: Use `from sagemaker.core.resources import Model` — see SDK v3 Guardrails table above + +## Endpoint Creation Timeout + +**Error**: Endpoint stays in `Creating` status and eventually fails +**Cause**: Model too large for instance, or container startup timeout too low +**Solution**: +1. Check model memory requirements against instance GPU memory +2. Set `ContainerStartupHealthCheckTimeoutInSeconds=900` (or higher for very large models) +3. Verify CUDA compatibility — see CUDA Compatibility table + +## ModelBuilder Issues + +**Error**: `ModelBuilder.build()` fails with dependency errors +**Cause**: Local packages captured by auto-dependency detection +**Solution**: Use `dependencies={"auto": False}` and manage dependencies explicitly + +# License + +``` +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +``` diff --git a/sagemaker-ai/mcp.json b/sagemaker-ai/mcp.json new file mode 100644 index 0000000..8ee2fda --- /dev/null +++ b/sagemaker-ai/mcp.json @@ -0,0 +1,17 @@ +{ + "mcpServers": { + "awslabs.sagemaker-ai-mcp-server": { + "command": "uvx", + "args": [ + "awslabs.sagemaker-ai-mcp-server@latest", + "--allow-write", + "--allow-sensitive-data-access" + ], + "env": { + "FASTMCP_LOG_LEVEL": "ERROR" + }, + "autoApprove": [], + "disabled": false + } + } +} diff --git a/sagemaker-ai/steering/automl-autogluon.md b/sagemaker-ai/steering/automl-autogluon.md new file mode 100644 index 0000000..fb2dce5 --- /dev/null +++ b/sagemaker-ai/steering/automl-autogluon.md @@ -0,0 +1,343 @@ +# AutoML with AutoGluon on SageMaker AI + +## Overview + +AutoGluon is an open-source AutoML framework that automates model selection, hyperparameter tuning, and ensembling. On SageMaker, it runs via the AutoGluon DLC (Deep Learning Container) with SDK v3 `ModelTrainer`. + +Supports three task types: +- Tabular classification/regression +- Time series forecasting +- Multimodal (text + tabular fusion) + +## DLC Image Retrieval + +```python +from sagemaker.core import image_uris + +image_uri = image_uris.retrieve( + "autogluon", + region=region, + version="1.5", # Check latest at DLC images page + py_version="py312", + image_scope="training", # or "inference" + instance_type="ml.m5.2xlarge", +) +``` + +## Training Pattern + +```python +from sagemaker.train import ModelTrainer +from sagemaker.core.training.configs import ( + Compute, SourceCode, OutputDataConfig, StoppingCondition, +) + +trainer = ModelTrainer( + training_image=image_uri, + role=role_arn, + source_code=SourceCode( + source_dir=".", + entry_script="train.py", + ), + compute=Compute( + instance_type="ml.m5.2xlarge", + instance_count=1, + volume_size_in_gb=100, + ), + output_data_config=OutputDataConfig( + s3_output_path=f"s3://{bucket}/output", + ), + base_job_name="autogluon-tabular", + hyperparameters={ + "config": "config.yaml", + "train-dir": "/opt/ml/input/data/train", + }, + stopping_condition=StoppingCondition( + max_runtime_in_seconds=7200, + ), +) + +trainer.train( + input_data_config=[{ + "channel_name": "train", + "data_source": { + "s3_data_source": { + "s3_uri": f"s3://{bucket}/data/train/", + "s3_data_type": "S3Prefix", + } + }, + }], + wait=True, + logs=True, +) +``` + + +## Training Scripts + +### Tabular Classification + +```python +"""train.py — AutoGluon tabular training on SageMaker""" +import os +import yaml +from autogluon.tabular import TabularPredictor +import pandas as pd + +train_dir = os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train") +model_dir = os.environ.get("SM_MODEL_DIR", "/opt/ml/model") + +# Load config +with open("config.yaml") as f: + config = yaml.safe_load(f) + +# Load data +df = pd.read_csv(os.path.join(train_dir, "train.csv")) + +# Train +predictor = TabularPredictor( + label=config["label_column"], + path=model_dir, + eval_metric=config.get("eval_metric", "roc_auc"), +).fit( + train_data=df, + time_limit=config.get("time_limit", 3600), + presets=config.get("presets", "best_quality"), +) + +# Save leaderboard +leaderboard = predictor.leaderboard(silent=True) +leaderboard.to_csv(os.path.join(model_dir, "leaderboard.csv"), index=False) +print(f"Best model: {predictor.model_best}") +``` + +### Time Series Forecasting + +```python +"""train.py — AutoGluon time series training on SageMaker""" +import os +from autogluon.timeseries import TimeSeriesPredictor, TimeSeriesDataFrame +import pandas as pd + +train_dir = os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train") +model_dir = os.environ.get("SM_MODEL_DIR", "/opt/ml/model") + +df = pd.read_csv(os.path.join(train_dir, "train.csv"), parse_dates=["timestamp"]) +ts_df = TimeSeriesDataFrame.from_data_frame( + df, id_column="item_id", timestamp_column="timestamp", +) + +predictor = TimeSeriesPredictor( + target="target", + prediction_length=24, + path=model_dir, + eval_metric="MASE", +).fit( + train_data=ts_df, + time_limit=3600, + presets="best_quality", +) +``` + +### Multimodal (Text + Tabular) + +Requires GPU instance (`ml.g4dn.xlarge` or larger). + +```python +"""train.py — AutoGluon multimodal training on SageMaker""" +import os +from autogluon.multimodal import MultiModalPredictor +import pandas as pd + +train_dir = os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train") +model_dir = os.environ.get("SM_MODEL_DIR", "/opt/ml/model") + +df = pd.read_csv(os.path.join(train_dir, "train.csv")) + +predictor = MultiModalPredictor( + label="label", + path=model_dir, + eval_metric="roc_auc", +).fit( + train_data=df, + time_limit=3600, + presets="best_quality", +) +``` + +## Inference (Real-Time Endpoint) + +AutoGluon DLC uses TorchServe and requires `code/inference.py` inside `model.tar.gz`. + +### Repackaging Model Artifacts + +Pipeline training outputs only contain model artifacts. You must repackage with the inference script: + +```python +import tarfile, os, shutil + +# Download original model.tar.gz from S3 +# Extract, add code/inference.py, repackage +os.makedirs("repackaged/code", exist_ok=True) +with tarfile.open("model.tar.gz", "r:gz") as tar: + tar.extractall("repackaged") +shutil.copy("serve.py", "repackaged/code/inference.py") +with tarfile.open("model-repackaged.tar.gz", "w:gz") as tar: + tar.add("repackaged", arcname=".") +# Upload repackaged model to S3 +``` + +### Inference Handler (serve.py → code/inference.py) + +```python +"""Inference handler for AutoGluon on SageMaker""" +import os, json, io +import pandas as pd + +def model_fn(model_dir): + from autogluon.tabular import TabularPredictor + return TabularPredictor.load(model_dir) + +def input_fn(request_body, content_type): + if content_type == "text/csv": + return pd.read_csv(io.StringIO(request_body)) + elif content_type == "application/json": + return pd.DataFrame(json.loads(request_body)) + raise ValueError(f"Unsupported content type: {content_type}") + +def predict_fn(data, model): + predictions = model.predict(data) + probabilities = model.predict_proba(data) + return {"predictions": predictions.tolist(), "probabilities": probabilities.values.tolist()} + +def output_fn(prediction, accept): + return json.dumps(prediction), "application/json" +``` + +### Deploy with SDK v3 Core API + +```python +from sagemaker.core.resources import Model, EndpointConfig, Endpoint +from sagemaker.core.shapes.shapes import ContainerDefinition, ProductionVariant +from sagemaker.core import image_uris + +inference_image = image_uris.retrieve( + "autogluon", region=region, version="1.5", + py_version="py312", image_scope="inference", + instance_type="ml.m5.xlarge", +) + +Model.create( + model_name="autogluon-model", + primary_container=ContainerDefinition( + image=inference_image, + model_data_url=f"s3://{bucket}/output/model-repackaged.tar.gz", + ), + execution_role_arn=role_arn, +) + +EndpointConfig.create( + endpoint_config_name="autogluon-endpoint", + production_variants=[ProductionVariant( + variant_name="AllTraffic", + model_name="autogluon-model", + initial_instance_count=1, + instance_type="ml.m5.xlarge", + initial_variant_weight=1.0, + )], +) + +Endpoint.create( + endpoint_name="autogluon-endpoint", + endpoint_config_name="autogluon-endpoint", +) +``` + +## SageMaker Pipelines with AutoGluon + +### Correct Imports (SDK v3) + +```python +from sagemaker.core.processing import ScriptProcessor, ProcessingInput, ProcessingOutput +from sagemaker.core.workflow.parameters import ParameterString +from sagemaker.core.workflow.pipeline_context import PipelineSession +from sagemaker.mlops.workflow.pipeline import Pipeline +from sagemaker.mlops.workflow.steps import ProcessingStep, TrainingStep +from sagemaker.train import ModelTrainer +``` + +### Pipeline Pattern + +```python +# Processing step +processor = ScriptProcessor( + image_uri=image_uri, role=role_arn, + instance_type="ml.m5.xlarge", instance_count=1, +) +processing_step = ProcessingStep( + name="Preprocess", + step_args=processor.run( + code="preprocess.py", + inputs=[ProcessingInput(source=raw_s3, destination="/opt/ml/processing/input")], + outputs=[ProcessingOutput(source="/opt/ml/processing/output", destination=processed_s3)], + ), +) + +# Training step +training_step = TrainingStep( + name="Train", + step_args=trainer.train( + input_data_config=[{ + "channel_name": "train", + "data_source": {"s3_data_source": { + "s3_uri": processing_step.properties.ProcessingOutputConfig + .Outputs["output"].S3Output.S3Uri, + "s3_data_type": "S3Prefix", + }}, + }], + ), +) + +# Create pipeline +pipeline = Pipeline( + name="AutoGluonPipeline", + steps=[processing_step, training_step], + sagemaker_session=PipelineSession(), +) +pipeline.upsert(role_arn=role_arn) +``` + +## Troubleshooting + +### OOM during multimodal training +- Multimodal uses a text transformer — requires GPU instances +- Reduce `time_limit` or use `presets="medium_quality"` to limit ensemble size +- Use `ml.g4dn.xlarge` minimum, `ml.g5.2xlarge` for larger datasets + +### TorchServe startup failure on inference endpoint +- AutoGluon DLC uses TorchServe — requires `code/inference.py` inside `model.tar.gz` +- If you trained via Pipeline, the output model.tar.gz won't have the inference script — repackage it + +### model.tar.gz packaging errors +- The archive must have `code/inference.py` at the root level (not nested under extra directories) +- Use `tar.add("repackaged", arcname=".")` to avoid extra directory nesting + +### Time series prediction returns empty or errors +- Requires 48+ historical timestamps per item for reliable predictions +- Ensure `item_id` and `timestamp` columns match training format exactly + +## Key Considerations + +- Use `ParameterString` for all hyperparameters (SageMaker passes them as strings) +- Multimodal tasks require GPU instances (`ml.g4dn.xlarge`+) +- Time series inference requires 48+ historical timestamps per item +- Name evaluation scripts `run_evaluation.py` (not `evaluate.py`) to avoid conflicts with the HuggingFace `evaluate` package +- AutoGluon DLC version 1.5 uses Python 3.12 + +## Instance Recommendations + +| Task | Training | Inference | +|------|----------|-----------| +| Tabular | ml.m5.2xlarge | ml.m5.xlarge | +| Time Series | ml.m5.2xlarge | ml.m5.xlarge | +| Multimodal | ml.g4dn.xlarge | ml.g4dn.xlarge | diff --git a/sagemaker-ai/steering/hyperpod-inference.md b/sagemaker-ai/steering/hyperpod-inference.md new file mode 100644 index 0000000..784711e --- /dev/null +++ b/sagemaker-ai/steering/hyperpod-inference.md @@ -0,0 +1,688 @@ +# Inference on SageMaker HyperPod + +## Reference Repository + +For working deployment examples of popular open-source models on SageMaker (Llama, Mistral, Mixtral, Falcon, CodeLlama, etc.), see: +`https://github.com/aws-samples/sagemaker-genai-hosting-examples` + +This repo covers DJL LMI, TGI, Inferentia2 integration, and performance tuning — the same container images and patterns apply to HyperPod inference deployments. + +## Overview + +HyperPod extends beyond training to provide a unified inference platform on the same persistent clusters. Deploy, scale, and optimize ML models with enterprise-grade reliability using Kubernetes-native workflows. Supports JumpStart foundation models, custom/fine-tuned models from S3 or FSx, autoscaling, KV caching, intelligent routing, and MIG GPU partitioning. + +Use HyperPod inference when: +- Running production LLM inference on persistent GPU clusters +- Need autoscaling, KV caching, or intelligent routing for LLM workloads +- Deploying fine-tuned models directly from training artifacts on the same cluster +- Multiple teams sharing inference infrastructure with Task Governance +- Need unified training + inference on the same compute + +## Architecture + +``` +HyperPod Cluster (Inference) +├── EKS Cluster (control plane) +│ ├── Inference Operator (deployment lifecycle) +│ ├── Task Governance (multi-team quotas via Kueue) +│ ├── KEDA (autoscaling) +│ └── Health Monitoring Agent +├── Worker Instance Groups +│ ├── GPU workers (P5, P4d, G5, G6) +│ └── Trainium workers (Trn1, Trn2) +├── Inference Features +│ ├── KV Cache (L1 CPU + L2 Redis/Tiered Storage) +│ ├── Intelligent Routing (prefix-aware, kv-aware, session, round-robin) +│ ├── MIG GPU Partitioning +│ └── SageMaker Endpoint Registration +└── Storage + ├── S3 (model artifacts) + ├── FSx for Lustre (large model weights, shared) + └── EBS (local scratch) +``` + +## Prerequisites + +Before deploying inference on HyperPod: + +- [ ] HyperPod cluster created with EKS orchestration +- [ ] HyperPod inference operator installed (via helm or cluster creation) +- [ ] `kubectl` and `jq` installed and configured +- [ ] `sagemaker-hyperpod` CLI/SDK installed (`pip install sagemaker-hyperpod`) +- [ ] Cluster context set: `hyp set-cluster-context --cluster-name ` +- [ ] Namespace with `hyperpod-inference` service account created by admin +- [ ] IAM permissions for S3 access (model artifacts) and SageMaker endpoint creation + +## Deployment Interfaces + +| Interface | Best For | JumpStart | Custom Models | +|-----------|----------|-----------|---------------| +| SageMaker Studio UI | Visual one-click deployment | ✅ | ❌ | +| `kubectl` | Kubernetes-native, full YAML control | ✅ | ✅ | +| HyperPod CLI (`hyp`) | Quick CLI-based deployment | ✅ | ✅ | +| HyperPod Python SDK | Programmatic integration | ✅ | ✅ | + +## Deploy JumpStart Models + +TLS certificates are auto-generated for secure ALB endpoint communication and stored in the specified S3 bucket. The `--tls-certificate-output-s3-uri` parameter is optional — omit it if you only need the SageMaker endpoint (no ALB). + +### Via HyperPod CLI + +```bash +pip install sagemaker-hyperpod + +# Set cluster context +hyp set-cluster-context --cluster-name my-cluster + +# Deploy a JumpStart model +hyp create hyp-jumpstart-endpoint \ + --model-id deepseek-llm-r1-distill-qwen-1-5b \ + --instance-type ml.g5.8xlarge \ + --endpoint-name my-jumpstart-endpoint \ + --tls-certificate-output-s3-uri s3://my-tls-bucket/ \ + --namespace default +``` + +### Via HyperPod Python SDK + +```python +from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import ( + Model, Server, SageMakerEndpoint, TlsConfig, +) +from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint + +model = Model(model_id="deepseek-llm-r1-distill-qwen-1-5b") +server = Server(instance_type="ml.g5.8xlarge") +endpoint_name = SageMakerEndpoint(name="my-jumpstart-endpoint") +tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://my-tls-bucket/") + +js_endpoint = HPJumpStartEndpoint( + model=model, + server=server, + sage_maker_endpoint=endpoint_name, + tls_config=tls_config, + namespace="default", +) +js_endpoint.create() +``` + +### Via kubectl (JumpStartModel CRD) + +```bash +# Setup +export REGION= +export HYPERPOD_CLUSTER_NAME="my-cluster" +export MODEL_ID="deepseek-llm-r1-distill-qwen-1-5b" +export MODEL_VERSION="2.0.4" +export INSTANCE_TYPE="ml.g5.8xlarge" +export CLUSTER_NAMESPACE="default" +export SAGEMAKER_ENDPOINT_NAME="my-jumpstart-endpoint" + +# Get EKS cluster name and configure kubectl +export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster \ + --cluster-name $HYPERPOD_CLUSTER_NAME \ + --query 'Orchestrator.Eks.ClusterArn' --output text | cut -d'/' -f2) +aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION +``` + +```yaml +# jumpstart_model.yaml +apiVersion: inference.sagemaker.aws.amazon.com/v1 +kind: JumpStartModel +metadata: + name: my-jumpstart-endpoint + namespace: default +spec: + sageMakerEndpoint: + name: my-jumpstart-endpoint + model: + modelHubName: SageMakerPublicHub + modelId: deepseek-llm-r1-distill-qwen-1-5b + modelVersion: "2.0.4" + server: + instanceType: ml.g5.8xlarge + # acceleratorPartitionType: "1g.10gb" # Optional: for MIG-enabled instances + metrics: + enabled: true + maxDeployTimeInSeconds: 1800 + autoScalingSpec: + minReplicaCount: 1 + maxReplicaCount: 5 + cloudWatchTrigger: + name: "SageMaker-Invocations" + namespace: "AWS/SageMaker" + metricName: "Invocations" + targetValue: 10 + minValue: 0.0 + metricCollectionPeriod: 30 + metricStat: "Sum" + metricType: "Average" + dimensions: + - name: "EndpointName" + value: "my-jumpstart-endpoint" + - name: "VariantName" + value: "AllTraffic" +``` + +```bash +kubectl apply -f jumpstart_model.yaml +``` + +### Discovering JumpStart Models + +```bash +# List available models in SageMaker Public Hub +aws sagemaker list-hub-contents \ + --hub-name SageMakerPublicHub \ + --hub-content-type Model \ + --query '{Models: HubContentSummaries[].{ModelId:HubContentName,Version:HubContentVersion}}' \ + --output json + +# Check supported instance types for a model +aws sagemaker describe-hub-content \ + --hub-name SageMakerPublicHub \ + --hub-content-type Model \ + --hub-content-name "deepseek-llm-r1-distill-qwen-1-5b" \ + --output json | jq -r '.HubContentDocument | fromjson | {Default: .DefaultInferenceInstanceType, Supported: .SupportedInferenceInstanceTypes}' +``` + +## Deploy Custom / Fine-Tuned Models + +### Via HyperPod CLI (Model from S3) + +```bash +# Upload model artifacts to S3 first +aws s3 cp ./my-model s3://my-model-bucket/models/my-model/ --recursive + +# Deploy with DJL LMI container +hyp create hyp-custom-endpoint \ + --endpoint-name my-custom-endpoint \ + --model-name my-model \ + --model-source-type s3 \ + --model-location models/my-model/ \ + --s3-bucket-name my-model-bucket \ + --s3-region us-west-2 \ + --instance-type ml.g5.8xlarge \ + --image-uri 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128 \ + --container-port 8080 \ + --model-volume-mount-name modelmount \ + --tls-certificate-output-s3-uri s3://my-tls-bucket/ \ + --namespace default +``` + +### Via HyperPod Python SDK (Model from S3) + +```python +from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import ( + Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables, +) +from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint + +model = Model( + model_source_type="s3", + model_location="models/my-model/", + s3_bucket_name="my-model-bucket", + s3_region="us-west-2", + prefetch_enabled=True, +) + +server = Server( + instance_type="ml.g5.8xlarge", + image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128", + container_port=8080, + model_volume_mount_name="model-weights", +) + +resources = { + "requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"}, + "limits": {"nvidia.com/gpu": 1}, +} + +env = EnvironmentVariables( + HF_MODEL_ID="/opt/ml/model", +) + +endpoint_name = SageMakerEndpoint(name="my-custom-endpoint") +tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://my-tls-bucket/") + +custom_endpoint = HPCustomEndpoint( + model=model, + server=server, + resources=resources, + environment=env, + sage_maker_endpoint=endpoint_name, + tls_config=tls_config, +) +custom_endpoint.create() +``` + +### Via kubectl (InferenceEndpointConfig CRD) + +#### Model from S3 + +```yaml +# custom_model_s3.yaml +apiVersion: inference.sagemaker.aws.amazon.com/v1 +kind: InferenceEndpointConfig +metadata: + name: my-custom-endpoint + namespace: default +spec: + sageMakerEndpoint: + name: my-custom-endpoint + modelSourceConfig: + modelSourceType: s3 + modelLocation: "models/my-model/" + s3Storage: + bucketName: "my-model-bucket" + region: "us-west-2" + worker: + image: "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128" + modelVolumeMountName: "model-weights" + modelInvocationPort: + containerPort: 8080 + resources: + requests: + cpu: "30000m" + nvidia.com/gpu: 1 + memory: "100Gi" + limits: + nvidia.com/gpu: 1 + environmentVariables: + - name: HF_MODEL_ID + value: "/opt/ml/model" + - name: OPTION_ROLLING_BATCH + value: "vllm" + - name: OPTION_DTYPE + value: "fp16" + - name: OPTION_MAX_MODEL_LEN + value: "4096" + server: + instanceType: ml.g5.8xlarge + metrics: + enabled: true + maxDeployTimeInSeconds: 1800 +``` + +#### Model from FSx for Lustre + +```yaml +# custom_model_fsx.yaml +apiVersion: inference.sagemaker.aws.amazon.com/v1 +kind: InferenceEndpointConfig +metadata: + name: my-fsx-endpoint + namespace: default +spec: + sageMakerEndpoint: + name: my-fsx-endpoint + modelSourceConfig: + modelSourceType: fsx + modelLocation: "/fsx/models/my-model" + fsxStorage: + claimName: "fsx-claim" + worker: + image: "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128" + modelVolumeMountName: "model-weights" + modelInvocationPort: + containerPort: 8080 + resources: + requests: + cpu: "30000m" + nvidia.com/gpu: 1 + memory: "100Gi" + limits: + nvidia.com/gpu: 1 + environmentVariables: + - name: HF_MODEL_ID + value: "/opt/ml/model" + - name: OPTION_ROLLING_BATCH + value: "vllm" + - name: OPTION_DTYPE + value: "fp16" + server: + instanceType: ml.g5.8xlarge + metrics: + enabled: true +``` + +### Deploying HuggingFace Models (Custom from Hub) + +To deploy a HuggingFace model not available in JumpStart, download artifacts to S3 first: + +```bash +# Download model from HuggingFace Hub +pip install huggingface-hub +huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./llama-3.1-8b + +# Upload to S3 +aws s3 cp ./llama-3.1-8b s3://my-model-bucket/models/llama-3.1-8b/ --recursive + +# Deploy via CLI +hyp create hyp-custom-endpoint \ + --endpoint-name llama-31-8b-endpoint \ + --model-name llama-31-8b \ + --model-source-type s3 \ + --model-location models/llama-3.1-8b/ \ + --s3-bucket-name my-model-bucket \ + --s3-region us-west-2 \ + --instance-type ml.g5.8xlarge \ + --image-uri 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128 \ + --container-port 8080 \ + --model-volume-mount-name modelmount \ + --env '{"HF_MODEL_ID":"/opt/ml/model","OPTION_ROLLING_BATCH":"vllm","OPTION_DTYPE":"fp16","OPTION_MAX_MODEL_LEN":"4096"}' \ + --namespace default +``` + +## Invocation + +### Via HyperPod CLI + +```bash +# JumpStart endpoint +hyp invoke hyp-jumpstart-endpoint \ + --endpoint-name my-jumpstart-endpoint \ + --body '{"inputs": "What is machine learning?"}' + +# Custom endpoint +hyp invoke hyp-custom-endpoint \ + --endpoint-name my-custom-endpoint \ + --body '{"inputs": "What is machine learning?", "parameters": {"max_new_tokens": 256}}' +``` + +### Via HyperPod Python SDK + +```python +# JumpStart +data = '{"inputs": "What is machine learning?"}' +response = js_endpoint.invoke(body=data).body.read() +print(response) + +# Custom +data = '{"inputs": "What is machine learning?", "parameters": {"max_new_tokens": 256}}' +response = custom_endpoint.invoke(body=data).body.read() +print(response) +``` + +### Via AWS CLI (SageMaker Runtime) + +```bash +aws sagemaker-runtime invoke-endpoint \ + --endpoint-name my-endpoint \ + --content-type "application/json" \ + --body '{"inputs": "What is machine learning?"}' \ + --region us-west-2 \ + --cli-binary-format raw-in-base64-out \ + /dev/stdout +``` + +### Via boto3 + +```python +import boto3, json +from botocore.config import Config + +runtime = boto3.client("sagemaker-runtime", config=Config(read_timeout=600)) +response = runtime.invoke_endpoint( + EndpointName="my-endpoint", + ContentType="application/json", + Body=json.dumps({ + "inputs": "What is machine learning?", + "parameters": {"max_new_tokens": 256, "temperature": 0.7}, + }), +) +result = json.loads(response["Body"].read()) +``` + +## KV Caching and Intelligent Routing + +### KV Cache Configuration + +Two-tier caching architecture for LLM inference performance: +- L1 Cache: CPU memory for low-latency local reuse (enabled by default) +- L2 Cache: Redis or managed tiered storage for node-level cache sharing + +```yaml +# Add to your deployment YAML spec +kvCacheSpec: + enableL1Cache: true + enableL2Cache: true + l2CacheSpec: + l2CacheBackend: tieredstorage # or "redis" + # l2CacheLocalUrl: # Required only for redis backend +``` + +Security notes: +- KV cache data is stored unencrypted at rest for performance +- When using tiered storage, multiple deployments share cache with no isolation +- For strict encryption or multi-tenant isolation, use dedicated Redis instances or separate clusters + +### Intelligent Routing + +Routes requests to the inference instance most likely to have relevant cached KV pairs: + +| Strategy | Behavior | +|----------|----------| +| `prefixaware` | Same prompt prefix → same instance (default) | +| `kvaware` | Routes to instance with highest KV cache hit rate | +| `session` | Same user session → same instance | +| `roundrobin` | Even distribution, no cache consideration | + +```yaml +intelligentRoutingSpec: + enabled: true + routingStrategy: prefixaware # or kvaware, session, roundrobin +``` + +## Autoscaling + +### Built-in autoScalingSpec (in deployment YAML) + +```yaml +autoScalingSpec: + minReplicaCount: 1 # 0 enables scale-to-zero + maxReplicaCount: 5 + pollingInterval: 15 + cooldownPeriod: 120 + scaleDownStabilizationTime: 60 + scaleUpStabilizationTime: 0 + cloudWatchTrigger: + name: "SageMaker-Invocations" + namespace: "AWS/SageMaker" + metricName: "Invocations" + targetValue: 10 + minValue: 0.0 + metricCollectionPeriod: 30 + metricStat: "Sum" + metricType: "Average" + dimensions: + - name: "EndpointName" + value: "my-endpoint" + - name: "VariantName" + value: "AllTraffic" +``` + +### Standalone KEDA ScaledObject (via kubectl) + +For advanced scenarios — CloudWatch, SQS, Prometheus, CPU/memory triggers: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: my-scaledobject + namespace: default +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: my-endpoint + minReplicaCount: 1 + maxReplicaCount: 4 + pollingInterval: 10 + triggers: + - type: aws-cloudwatch + metadata: + namespace: AWS/SageMaker + metricName: Invocations + targetMetricValue: "1" + minMetricValue: "1" + awsRegion: "us-west-2" + dimensionName: EndpointName;VariantName + dimensionValue: my-endpoint;AllTraffic + metricStatPeriod: "30" + metricStat: "Sum" + identityOwner: operator +``` + +### Scale-to-Zero + +Set `minReplicaCount: 0` with an activation threshold. Ensure the activation metric does not depend on the pods themselves (they won't be running to generate it). + +## GPU Partitioning (MIG) + +Multi-Instance GPU (MIG) allows running multiple inference workloads on a single GPU for better utilization: + +```yaml +# In JumpStartModel or InferenceEndpointConfig spec +server: + instanceType: ml.p5.48xlarge + acceleratorPartitionType: "1g.10gb" # MIG profile +``` + +## Multi-Instance Type Deployment + +Specify a prioritized list of instance types for automatic failover when preferred capacity is unavailable: + +```yaml +# Uses preferredDuringSchedulingIgnoredDuringExecution node affinity +# System evaluates instance types in priority order +server: + instanceType: ml.g5.8xlarge # Primary preference + # Additional instance types configured via nodeAffinity +``` + +## Container Selection for Custom Models + +| Model Type | Recommended Container | Notes | +|------------|----------------------|-------| +| Standard LLM (Llama, Mistral, Qwen, DeepSeek) | DJL LMI with vLLM | `djl-inference:0.36.0-lmi20.0.0-cu128` | +| Multimodal/Vision (Qwen-VL, LLaVA) | DJL LMI with vLLM | Same container, vLLM supports vision models | +| Custom inference handler | HuggingFace Inference DLC | Requires `code/inference.py` in model.tar.gz | +| PyTorch custom model | PyTorch Inference DLC | For non-LLM custom models | + +Always check latest images at: `https://aws.github.io/deep-learning-containers/reference/available_images/` + +### CUDA Compatibility (Same as SageMaker Endpoints) + +| CUDA Version | Works On | Fails On | +|---|---|---| +| cu124 | g5, g6, p5 | — | +| cu128 | g5, g6, p5 | — | +| cu129 | g6, p5 | g5 (driver version mismatch → CannotStartContainerError) | + +Note: cu129 failure on g5 is due to the DLC requiring a newer GPU driver than what's installed on g5 instances. Inferentia2 (inf2) instances do not use CUDA — use Neuron SDK containers instead. + +## Monitoring and Observability + +### Check Deployment Status + +```bash +# JumpStart +kubectl describe JumpStartModel my-endpoint -n default +hyp list hyp-jumpstart-endpoint +hyp describe hyp-jumpstart-endpoint --name my-endpoint + +# Custom +kubectl describe InferenceEndpointConfig my-endpoint -n default +hyp list hyp-custom-endpoint +hyp describe hyp-custom-endpoint --name my-endpoint + +# SageMaker endpoint status +aws sagemaker describe-endpoint --endpoint-name my-endpoint --output table + +# All resources in namespace +kubectl get pods,svc,deployment,JumpStartModel,InferenceEndpointConfig,sagemakerendpointregistration -n default +``` + +### View Logs + +```bash +# Via CLI +hyp list-pods hyp-jumpstart-endpoint --namespace default +hyp get-logs hyp-jumpstart-endpoint --namespace default --pod-name + +# Via kubectl +kubectl logs -n default + +# Operator logs +hyp get-operator-logs hyp-jumpstart-endpoint --since-hours 0.5 + +# Grafana / Prometheus (if observability stack deployed) +hyp get-monitoring --grafana +hyp get-monitoring --prometheus +``` + +### Key Metrics + +HyperPod inference tracks: time-to-first-token (TTFT), latency, GPU utilization, invocations, and error rates. Enable metrics in deployment YAML: + +```yaml +metrics: + enabled: true + modelMetrics: + port: 8080 +``` + +## Cleanup + +```bash +# Delete JumpStart deployment +hyp delete hyp-jumpstart-endpoint --name my-endpoint +# or +kubectl delete JumpStartModel my-endpoint -n default + +# Delete custom deployment +hyp delete hyp-custom-endpoint --name my-endpoint +# or +kubectl delete InferenceEndpointConfig my-endpoint -n default + +# Verify cleanup +kubectl get pods,svc,deployment -n default +aws sagemaker describe-endpoint --endpoint-name my-endpoint --region us-west-2 +``` + +## Troubleshooting + +### Deployment stuck in DeploymentInProgress +- Check pod status: `kubectl get pods -n ` +- Check operator logs: `hyp get-operator-logs hyp-custom-endpoint --since-hours 1` +- Verify instance type is available in cluster: `aws sagemaker describe-cluster --cluster-name --query "InstanceGroups"` +- Verify namespace has `hyperpod-inference` service account + +### Pod CrashLoopBackOff +- Check container logs: `kubectl logs -n ` +- CUDA version mismatch: use cu128 for g5, cu129 only for g6/p5 +- OOM: model too large for instance — use larger instance or reduce `OPTION_MAX_MODEL_LEN` +- Image pull failure: verify ECR image URI and region + +### SageMaker endpoint not created +- Check SageMakerEndpointRegistration: `kubectl describe sagemakerendpointregistration -n ` +- Verify IAM role has `sagemaker:CreateEndpoint` permissions +- Ensure endpoint name is unique across the account + +### Model loading timeout +- Increase `maxDeployTimeInSeconds` in YAML (default 1800) +- For large models from S3, ensure S3 VPC endpoint exists for faster downloads +- For FSx, verify PVC is mounted and model path is correct + +### Autoscaling not working +- Verify KEDA operator is installed: `kubectl get pods -n keda` +- Check ScaledObject status: `kubectl describe scaledobject -n ` +- Verify CloudWatch metrics are being emitted (endpoint must receive traffic first) +- Ensure KEDA operator IAM role has CloudWatch read permissions + +### Cannot invoke endpoint +- Wait for deployment status to be `DeploymentComplete` / endpoint `InService` +- First invocation may be slow (model loading) — use `Config(read_timeout=600)` in boto3 +- Check security groups allow inbound traffic on the container port diff --git a/sagemaker-ai/steering/hyperpod.md b/sagemaker-ai/steering/hyperpod.md new file mode 100644 index 0000000..844fb18 --- /dev/null +++ b/sagemaker-ai/steering/hyperpod.md @@ -0,0 +1,273 @@ +# SageMaker HyperPod (Cluster Setup & Training) + +## Overview + +Purpose-built persistent compute clusters for foundation model training and inference. Reduces training time by up to 40% through automatic fault tolerance. Supports EKS and Slurm orchestration. + +Use HyperPod when: +- Pre-training or full fine-tuning models 70B+ +- Running multi-week/month training campaigns +- Multiple teams sharing GPU/Trainium infrastructure +- Need automatic fault recovery and node replacement +- Running production inference on persistent GPU clusters + +For inference deployment on HyperPod (JumpStart models, custom models, autoscaling, KV caching), load the dedicated steering file: + +``` +Call action "readSteering" with powerName="sagemaker-ai", steeringFile="hyperpod-inference.md" +``` + +## Architecture + +``` +HyperPod Cluster +├── EKS Cluster (control plane) +│ ├── Training Operator (job lifecycle) +│ ├── Task Governance (multi-team quotas via Kueue) +│ └── Health Monitoring Agent +├── Worker Instance Groups +│ ├── GPU workers (P5, P4d, G5, G6) +│ └── Trainium workers (Trn1, Trn2) +└── Storage + ├── FSx for Lustre (training data, checkpoints) + ├── S3 via Mountpoint CSI + └── EBS (local scratch) +``` + +## Quick Setup + +### Option 1: MCP Server (Recommended when using Kiro) + +Use the `manage_hyperpod_stacks` MCP tool for automated cluster deployment via managed CloudFormation templates (same templates as the HyperPod console UI): + +``` +manage_hyperpod_stacks(operation="deploy", stack_name="my-cluster-stack", region_name="us-east-1") +``` + +To check status: +``` +manage_hyperpod_stacks(operation="describe", stack_name="my-cluster-stack") +``` + +To list and inspect cluster nodes: +``` +manage_hyperpod_cluster_nodes(operation="list_clusters") +manage_hyperpod_cluster_nodes(operation="list_nodes", cluster_name="my-cluster") +manage_hyperpod_cluster_nodes(operation="describe_node", cluster_name="my-cluster", node_id="node-id") +``` + +**Note**: Cluster creation typically takes ~30 minutes. + +### Option 2: hyp CLI + +```bash +pip install sagemaker-hyperpod + +# Initialize cluster config +hyp init cluster-stack +# Edit config.yaml with your settings +hyp configure --stack-name my-stack + +# Validate and create +hyp validate +hyp create +``` + +### Key config.yaml Settings + +```yaml +template: cluster-stack +resource_name_prefix: my-training +hyperpod_cluster_name: my-cluster +kubernetes_version: "1.31" # Check EKS supported versions for your region +node_recovery: Automatic +node_provisioning_mode: Continuous + +instance_group_settings: + - InstanceCount: 4 + InstanceGroupName: gpu-workers + InstanceType: ml.p5.48xlarge + InstanceStorageConfigs: + - EbsVolumeConfig: + VolumeSizeInGB: 500 + +helm_operators: >- + trainingOperators.enabled=true, + mlflow.enabled=true, + health-monitoring-agent.enabled=true, + deep-health-check.enabled=true, + job-auto-restart.enabled=true +``` + +## Instance Types + +| Instance | Accelerator | Memory | Best For | $/hr (approx) | +|----------|------------|--------|----------|---------------| +| ml.p5.48xlarge | 8x H100 | 640 GB HBM3 | Large FM training | ~$55 | +| ml.p5e.48xlarge | 8x H200 | 1.1 TB HBM3e | Memory-bound models | Contact AWS | +| ml.p4d.24xlarge | 8x A100 | 320 GB HBM2 | Budget-conscious | ~$38 | +| ml.trn1.32xlarge | 16x Trainium v1 | 512 GB HBM | Cost-optimized | ~$25 | +| ml.trn2.48xlarge | 16x Trainium v2 | 1.5 TB HBM | Best price-perf | Contact AWS | +| ml.g5.48xlarge | 8x A10G | 192 GB | Fine-tuning, small models | ~$16 | + +### Sizing Guidance + +| Task | Recommended | +|------|-------------| +| Pre-training 70B | 16-32x ml.p5.48xlarge | +| Full fine-tuning 70B | 4-8x ml.p5.48xlarge | +| LoRA fine-tuning 70B | 1-2x ml.p5.48xlarge | +| Fine-tuning 7-13B | 1-4x ml.g5.48xlarge | + +## Training Operator + +Enables data scientists to submit jobs via `kubectl`. Install as EKS add-on. + +### Job Submission (HyperPodPyTorchJob) + +```yaml +apiVersion: sagemaker.amazonaws.com/v1 +kind: HyperPodPyTorchJob +metadata: + name: my-training-job + namespace: team-a + labels: + kueue.x-k8s.io/queue-name: team-a-localqueue + kueue.x-k8s.io/priority-class: high +spec: + nprocPerNode: "8" + runPolicy: + jobMaxRetryCount: 50 + restartPolicy: + numRestartBeforeFullJobRestart: 3 + evalPeriodSeconds: 21600 + maxFullJobRestarts: 1 + cleanPodPolicy: "All" + replicaSpecs: + - name: workers + replicas: 2 + template: + spec: + nodeSelector: + node.kubernetes.io/instance-type: ml.p5.48xlarge + containers: + - name: trainer + image: 763104351884.dkr.ecr..amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker + command: ["hyperpodrun"] + args: ["--nproc-per-node=8", "--nnodes=2", "train.py"] + resources: + limits: + nvidia.com/gpu: 8 + hugepages-2Mi: 5120Mi + requests: + nvidia.com/gpu: 8 + hugepages-2Mi: 5120Mi + memory: 32000Mi +``` + +### Log Monitoring Rules + +Add to job spec for automatic hang/fault detection: + +```yaml +logMonitoringConfiguration: + rules: + - name: "JobStart" + logPattern: ".*Training started.*" + expectedStartCutOffInSeconds: 120 + - name: "HangDetection" + logPattern: ".*step (\\d+).*" + expectedRecurringFrequencyInSeconds: 300 + expectedStartCutOffInSeconds: 600 + - name: "OOMDetection" + logPattern: ".*OutOfMemoryError.*" + faultOnMatch: true +``` + +## Task Governance (Multi-Team) + +Requires Kubernetes >= 1.30. Provides namespace isolation, compute quotas, priority scheduling, and idle resource borrowing. + +### Team Quota Example + +| Team | Namespace | GPUs | Priority | +|------|-----------|------|----------| +| Pre-training | team-pretrain | 48 (6 nodes × 8) | critical | +| Fine-tuning | team-finetune | 16 (2 nodes × 8) | high | +| Exploration | team-explore | 8 (1 node × 8) | low | + +Idle GPUs from one team are automatically lent to others and preempted when reclaimed. + +## Resiliency Features + +| Feature | Details | +|---------|---------| +| Health monitoring | Continuous on all nodes | +| Deep health checks | GPU/Trainium diagnostics, adds NoSchedule taint during check | +| Auto node replacement | Faulty nodes replaced without manual intervention | +| Job auto-resume | Training Operator restarts from last checkpoint | +| Checkpointless training | Recovery in minutes without saving checkpoints (Dec 2025+) | +| Elastic training | Dynamic cluster scaling during training | + +## Training Plans (Capacity Reservations) + +For guaranteed GPU availability: + +| Parameter | Options | +|-----------|---------| +| Advance booking | Up to 56 days ahead, or instant-start | +| Duration | 1-182 days | +| Instance quantities | 1, 2, 4, 8, 16, 32, or 64 | +| Cancellation | Cannot be cancelled once purchased | + +Use Training Plans when: +- GPU capacity is scarce in your region +- Running predictable multi-day training +- Need guaranteed availability for deadlines + +## HyperPod Recipes + +Pre-configured distributed training stacks for popular models on HyperPod clusters. These are different from the Model Customization Recipes (which target Training Jobs with Accelerate + DeepSpeed). HyperPod Recipes auto-configure distributed training, checkpointing, and optimizers for the HyperPod Training Operator. + +```bash +git clone https://github.com/aws/sagemaker-hyperpod-recipes +# Pick model + instance config from recipe table +# Launch — auto-configures distributed training, checkpointing, optimizers +``` + +## Prerequisites Checklist + +Before creating a cluster: + +- [ ] Service quotas requested (GPU instances, EBS volumes, network interfaces) +- [ ] IAM roles created (HyperPod service role, EKS node role, Training Operator role) +- [ ] VPC with private subnets (public subnets NOT supported) +- [ ] Security groups allowing all intra-group traffic (required for EFA) +- [ ] S3 VPC endpoint created +- [ ] EKS Pod Identity Agent add-on installed +- [ ] cert-manager installed on EKS cluster + +## Troubleshooting + +### Cluster stuck in Creating +- Use `manage_hyperpod_stacks(operation="describe", stack_name="...")` to check CloudFormation stack status and events +- Check CloudFormation console for detailed errors +- Verify service quotas are approved +- Ensure subnets are private (not public) + +### Training job not starting +- Verify Training Operator is running: `kubectl get pods -n aws-hyperpod` +- Check Kueue queue status: `kubectl get clusterqueue` +- Verify namespace and queue labels in job YAML + +### Node replacement taking too long +- Use `manage_hyperpod_cluster_nodes(operation="describe_node", cluster_name="...", node_id="...")` to check node status +- Deep health checks add NoSchedule taint — check: `kubectl get nodes -o wide` +- Verify instance availability in your AZ +- Check HyperPod cluster events in SageMaker console +- To update node software: `manage_hyperpod_cluster_nodes(operation="update_software", cluster_name="...")` + +### Neuron compilation slow (Trainium) +- First run compiles model graph (10-20 min) — subsequent runs use cache +- Set `NEURON_CC_FLAGS="--model-type transformer"` for faster compilation +- Use compilation cache: mount persistent storage for `/var/tmp/neuron-compile-cache` diff --git a/sagemaker-ai/steering/inference-endpoints.md b/sagemaker-ai/steering/inference-endpoints.md new file mode 100644 index 0000000..b682fb0 --- /dev/null +++ b/sagemaker-ai/steering/inference-endpoints.md @@ -0,0 +1,279 @@ +# Inference Endpoints on SageMaker AI + +## Reference Repository + +For working deployment examples of popular open-source models on SageMaker (Llama, Mistral, Mixtral, Falcon, CodeLlama, etc.), see: +`https://github.com/aws-samples/sagemaker-genai-hosting-examples` + +This repo covers DJL LMI, TGI, Inferentia2 integration, and performance tuning for real-time and async inference. + +## Container Selection + +### Where to Find Latest DLC Images + +The canonical source for AWS Deep Learning Container images is: +`https://aws.github.io/deep-learning-containers/reference/available_images/` + +The HuggingFace docs page (`huggingface.co/docs/sagemaker/en/reference`) is outdated and only lists images up to transformers 4.26. Do not use it. + +### DJL LMI with vLLM Backend (Recommended for LLMs) + +Best for: LLMs, multimodal models, any model supported by vLLM. + +Image pattern: `763104351884.dkr.ecr..amazonaws.com/djl-inference:-lmi-cu` + +Key env vars: +- `HF_MODEL_ID` — HuggingFace model ID (e.g., `meta-llama/Llama-3.1-8B-Instruct`) +- `OPTION_ROLLING_BATCH=vllm` — Use vLLM as the inference backend +- `OPTION_DTYPE=fp16` or `bf16` — Model precision (critical for fitting on GPU) +- `OPTION_MAX_MODEL_LEN=4096` — Maximum sequence length +- `OPTION_TENSOR_PARALLEL_DEGREE=1` — Number of GPUs for tensor parallelism + +Example: +```python +ContainerDefinition( + image="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128", + environment={ + "HF_MODEL_ID": "meta-llama/Llama-3.1-8B-Instruct", + "OPTION_ROLLING_BATCH": "vllm", + "OPTION_DTYPE": "fp16", + "OPTION_MAX_MODEL_LEN": "4096", + "OPTION_TENSOR_PARALLEL_DEGREE": "1", + }, +) +``` + +### HuggingFace Inference DLC + +Best for: Simple HF pipeline models (classification, NER, summarization) or models with custom inference.py handlers. + +Image pattern: `763104351884.dkr.ecr..amazonaws.com/huggingface-pytorch-inference:-transformers-gpu-py-cu-ubuntu` + +Key env vars: +- `HF_MODEL_ID` — HuggingFace model ID +- `HF_MODEL_REVISION` — Git revision (branch, tag, commit) +- `HF_TASK` — Pipeline task (e.g., `text-classification`, `image-to-text`) + +Limitations: +- Default handler loads models in FP32 — no env var to override dtype +- For 8B+ models on 24GB GPUs, this will OOM +- Custom `inference.py` must be packaged in `model.tar.gz` at `code/inference.py` +- The `SourceCode` parameter in ModelBuilder does NOT reliably place the handler where the DLC expects it + +### HuggingFace vLLM DLC + +Image pattern: `763104351884.dkr.ecr..amazonaws.com/huggingface-vllm:-transformers-gpu-py-cu-ubuntu` + +Note: As of the latest release, this DLC only ships with cu129 which does not work on ml.g5 instances. Use the DJL LMI container with vLLM backend instead for g5 instances. Always check the DLC images page for current versions. + +### CUDA Version Compatibility + +This is critical and easy to get wrong: + +| CUDA Version | Works On | Fails On | +|---|---|---| +| cu124 | g5, g6, p5 | — | +| cu128 | g5, g6, p5 | — | +| cu129 | g6, p5 | g5 (driver version mismatch → CannotStartContainerError) | + +Note: cu129 failure on g5 is due to the DLC requiring a newer GPU driver than what's installed on g5 instances, not a fundamental CUDA compute capability issue. Inferentia2 (inf2) instances do not use CUDA — use Neuron SDK containers instead. + +## SageMaker Python SDK v3 + +### When to Use ModelBuilder + +Use `ModelBuilder` for: +- JumpStart models (pass model ID string like `meta-textgeneration-llama-3-1-8b`) +- Simple HuggingFace models that work with the default pipeline handler + +Do NOT use `ModelBuilder` for: +- DJL LMI containers — it overrides the container behavior +- Models that need custom dtype or loading logic +- When you need full control over the container environment + +### When to Use Core API Directly + +Use `Model.create` + `EndpointConfig.create` + `Endpoint.create` for: +- DJL LMI deployments +- Any deployment where you need full control over the container image and env vars +- When ModelBuilder keeps overriding your settings + +```python +from sagemaker.core.resources import Model, EndpointConfig, Endpoint +from sagemaker.core.shapes.shapes import ContainerDefinition, ProductionVariant + +# Create model +Model.create( + model_name="my-model", + primary_container=ContainerDefinition( + image=f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128", + environment={ + "HF_MODEL_ID": "meta-llama/Llama-3.1-8B-Instruct", + "OPTION_ROLLING_BATCH": "vllm", + "OPTION_DTYPE": "fp16", + "OPTION_MAX_MODEL_LEN": "4096", + "OPTION_TENSOR_PARALLEL_DEGREE": "1", + }, + ), + execution_role_arn=role_arn, +) + +# Create endpoint config +EndpointConfig.create( + endpoint_config_name="my-endpoint", + production_variants=[ + ProductionVariant( + variant_name="AllTraffic", + model_name="my-model", + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + initial_variant_weight=1.0, + container_startup_health_check_timeout_in_seconds=900, + model_data_download_timeout_in_seconds=900, + ) + ], +) + +# Create endpoint +Endpoint.create( + endpoint_name="my-endpoint", + endpoint_config_name="my-endpoint", +) + +# Wait for InService +import boto3 +sm = boto3.client("sagemaker", region_name=region) +waiter = sm.get_waiter("endpoint_in_service") +waiter.wait(EndpointName="my-endpoint", WaiterConfig={"Delay": 30, "MaxAttempts": 60}) +``` + +### ModelBuilder Pitfalls + +1. `dependencies={"auto": True}` scans your local Python environment and installs those exact versions in the container. If your local machine has `torch==2.10` but the DLC has `torch==2.6`, it will overwrite the DLC's torch and break CUDA/NCCL. + + Fix: Always use `dependencies={"auto": False}` or `dependencies={"auto": False, "custom": ["only-what-you-need"]}`. + +2. `model=` string parameter makes ModelBuilder use the HF default handler regardless of `image_uri`. It ignores your DJL/vLLM container. + + Fix: Use the core API directly for DJL/vLLM containers. + +3. `InferenceSpec` serializes your Python class via cloudpickle. The container needs `cloudpickle` and `sagemaker` installed to deserialize it. + + Fix: Avoid InferenceSpec for production. Use the core API with container-native env vars. + +4. `deploy()` returns an `Endpoint` object (SDK v3), not a `Predictor` (SDK v2). Use `Endpoint.invoke()` or `boto3 sagemaker-runtime invoke_endpoint`. + +5. `SchemaBuilder` must be a `SchemaBuilder(sample_input, sample_output)` object, not a plain dict. + +## Endpoint Configuration + +### Timeouts + +For models that download from HuggingFace Hub at container startup: + +```python +ProductionVariant( + variant_name="AllTraffic", + model_name="my-model", + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + initial_variant_weight=1.0, + container_startup_health_check_timeout_in_seconds=900, # 15 min + model_data_download_timeout_in_seconds=900, # 15 min +) +``` + +Default is 300s which is too short for large models. + +### First Invocation + +The first invocation after deploy may timeout because: +- The model loads lazily on first request (DJL LMI behavior) +- Model weights download from HuggingFace Hub + +Solutions: +- Add a warm-up wait (120-180s) after endpoint goes InService +- Use `boto3` with `Config(read_timeout=600)` for the first call +- Implement retry logic in your client + +### Invocation Format + +DJL LMI with vLLM backend accepts: + +```python +# Simple text generation +payload = { + "inputs": "Your prompt here", + "parameters": {"max_new_tokens": 256, "temperature": 0.4}, +} + +# Multimodal (image + text) — Idefics3 format +payload = { + "inputs": "User: Describe this image.\nAssistant:", + "parameters": {"max_new_tokens": 256}, + "images": ["data:image/png;base64,"], +} +``` + +## Troubleshooting + +### CannotStartContainerError + +**Cause:** CUDA version incompatibility between container and instance. +**Fix:** Use cu128 for g5 instances, cu129 only for g6/p5. + +### Worker died / Load model failed + +**Cause:** Out of memory — model too large for GPU. +**Fix:** +- Use `OPTION_DTYPE=fp16` or `bf16` (DJL LMI) +- Use a larger instance type +- Reduce `OPTION_MAX_MODEL_LEN` +- For HF DLC: the default handler loads in FP32 with no override — switch to DJL LMI + +### ncclCommShrink undefined symbol + +**Cause:** `ModelBuilder` `auto:True` dependencies overwrote the DLC's PyTorch with an incompatible version. +**Fix:** Use `dependencies={"auto": False}`. + +### bitsandbytes not found + +**Cause:** Using a quantized model revision (e.g., `quantized8bit`) but the DLC doesn't include bitsandbytes. +**Fix:** Use the `main` revision with `OPTION_DTYPE=fp16` instead. + +### No inference script implementation found + +**Cause:** Custom `inference.py` not placed where the HF DLC expects it (`/opt/ml/model/code/inference.py`). +**Fix:** Package as `model.tar.gz` with `code/inference.py` inside, upload to S3, and set `model_data_url`. Or switch to DJL LMI which doesn't need custom handlers. + +### Invocation timeout (server error 0) + +**Cause:** Model still loading on first request. +**Fix:** Wait 120-180s after InService before first invocation. Use `Config(read_timeout=600)` in boto3. + +### SSO session expired during long deployments + +**Cause:** AWS SSO tokens expire (typically 1-8 hours). +**Fix:** Run `aws sso login` before long operations. For automated scripts, use IAM roles or long-lived credentials. + +## Model Memory Sizing + +Quick reference for GPU memory requirements: + +| Model Size | FP32 | FP16/BF16 | INT8 | Fits On | +|---|---|---|---|---| +| 7-8B | ~32GB | ~16GB | ~8GB | g5.xlarge (FP16), g6.xlarge (FP16) | +| 13B | ~52GB | ~26GB | ~13GB | g5.12xlarge (4x A10G, TP=2+), g6.12xlarge, p5 (any) | +| 70B | ~280GB | ~140GB | ~70GB | p5.48xlarge (8x H100), g5.48xlarge (8x A10G, TP=8) | + +Note: 13B FP16 (~26GB) exceeds single A10G capacity (24GB). Use multi-GPU instances with tensor parallelism. + +Always use FP16 or BF16 for inference — FP32 is wasteful and often won't fit. + +## HyperPod Inference + +For deploying inference on HyperPod clusters (JumpStart models, custom models from S3/FSx, autoscaling, KV caching, intelligent routing), load the dedicated steering file: + +``` +Call action "readSteering" with powerName="sagemaker-ai", steeringFile="hyperpod-inference.md" +``` diff --git a/sagemaker-ai/steering/model-monitor.md b/sagemaker-ai/steering/model-monitor.md new file mode 100644 index 0000000..91748a0 --- /dev/null +++ b/sagemaker-ai/steering/model-monitor.md @@ -0,0 +1,541 @@ +# SageMaker Model Monitor (SDK v3) + +## Overview + +SageMaker Model Monitor continuously monitors the quality of ML models in production by detecting data drift, model quality degradation, bias drift, and feature attribution changes. All 4 monitor types use the same workflow: baseline → schedule → detect violations. + +| Monitor Type | What It Detects | Requires Ground Truth | +|---|---|---| +| Data Quality | Feature drift, missing values, distribution shifts, schema violations | No | +| Model Quality | Degradation in accuracy, precision, recall, F1, AUC | Yes | +| Model Bias | Demographic parity difference (DPL), disparate impact (DI) | Yes | +| Model Explainability | SHAP feature attribution drift | No | + +## Architecture + +``` +┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ +│ Training │───>│ Model │───>│ Endpoint │ +│ (XGBoost) │ │ (S3 .tar.gz)│ │ (ml.m5.xlarge) │ +└─────────────┘ └──────────────┘ └────────┬──────────┘ + │ + DataCapture (100%) + │ + ▼ + ┌──────────────┐ + │ S3 Capture │ + │ (JSONL) │ + └──────┬───────┘ + │ + ┌──────────────┬────────────┴──────────────┐ + │ │ │ + ┌─────▼──────┐ ┌────▼───────────┐ ┌────────────▼──────┐ + │ DQ Monitor │ │ MQ Monitor │ │ Clarify Monitors │ + │ (hourly) │ │ (hourly + │ │ (Bias + SHAP, │ + │ │ │ ground truth) │ │ hourly) │ + └────────────┘ └────────────────┘ └───────────────────┘ +``` + +## SDK v3 Imports + +All monitoring classes live under `sagemaker.core` (verified with `sagemaker==3.5.0`): + +```python +from sagemaker.core.helper.session_helper import Session, get_execution_role +from sagemaker.core.model_monitor import ( + DefaultModelMonitor, + ModelQualityMonitor, + DataCaptureConfig, + CronExpressionGenerator, + DatasetFormat, + EndpointInput, +) +from sagemaker.core.model_monitor.clarify_model_monitoring import ( + ModelBiasMonitor, + ModelExplainabilityMonitor, +) +from sagemaker.core.model_monitor.model_monitoring import Constraints +from sagemaker.core.clarify import ( + BiasConfig, DataConfig, ModelConfig, + ModelPredictedLabelConfig, SHAPConfig, +) +``` + +## Prerequisites + +- `sagemaker>=3.5.0` (SDK v3 — v2 will NOT work) +- An endpoint deployed with DataCapture enabled +- IAM role with `AmazonSageMakerFullAccess` and S3 read/write + +## Step 1: Deploy Endpoint with Data Capture + +Data capture must be enabled on the endpoint before any monitor can work. + +```python +from sagemaker.serve.model_builder import ModelBuilder +from sagemaker.serve.builder.schema_builder import SchemaBuilder +from sagemaker.core.model_monitor import DataCaptureConfig + +data_capture_config = DataCaptureConfig( + enable_capture=True, + sampling_percentage=100, + destination_s3_uri=f"s3://{bucket}/datacapture/{endpoint_name}", + capture_options=["Input", "Output"], + csv_content_types=["text/csv"], +) + +builder = ModelBuilder( + image_uri=image_uri, + s3_model_data_url=repacked_model_uri, + role_arn=role, + sagemaker_session=sm_session, + schema_builder=SchemaBuilder( + sample_input="39,77516,13,2174,0,40,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States", + sample_output="0,0.123456", + ), + env_vars={ + "SAGEMAKER_PROGRAM": "inference.py", + "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code", # Local path inside container, NOT S3 + }, +) +builder.build() + +# Known SDK v3 bug: deploy() may raise ValidationError on kms_key_id +# The endpoint IS deployed successfully — catch and continue +try: + endpoint = builder.deploy( + endpoint_name=endpoint_name, + instance_type="ml.m5.xlarge", + initial_instance_count=1, + data_capture_config=data_capture_config, + ) +except Exception as e: + if "kms_key_id" in str(e).lower() or "validation error" in str(e).lower(): + print(f"Known SDK v3 bug (endpoint IS deployed): {type(e).__name__}") + else: + raise +``` + +Data capture latency: 2-5 minutes before capture files appear in S3. + +## Step 2: Data Quality Monitor + +Detects feature drift, missing values, and distribution shifts. No ground truth needed. + +### Baseline + +```python +data_quality_monitor = DefaultModelMonitor( + role=role, + instance_count=1, + instance_type="ml.m5.xlarge", + volume_size_in_gb=20, + max_runtime_in_seconds=1800, # Must be < schedule cadence (3600s for hourly) + sagemaker_session=sm_session, +) + +data_quality_monitor.suggest_baseline( + baseline_dataset=baseline_s3_uri, # CSV with header, features only (no target) + dataset_format=DatasetFormat.csv(header=True), + output_s3_uri=f"s3://{bucket}/data-quality/baseline-results", + wait=True, + logs=False, +) + +# Get baseline artifacts +statistics = data_quality_monitor.latest_baselining_job.baseline_statistics() +constraints = data_quality_monitor.latest_baselining_job.suggested_constraints() +``` + +### Schedule + +```python +data_quality_monitor.create_monitoring_schedule( + monitor_schedule_name="dq-monitor-schedule", + endpoint_input=endpoint_name, + output_s3_uri=f"s3://{bucket}/data-quality/reports", + statistics=statistics, + constraints=constraints, + schedule_cron_expression=CronExpressionGenerator.hourly(), + enable_cloudwatch_metrics=True, +) +``` + +### Trigger Violations (Testing) + +Send drifted data to the endpoint to trigger violations: + +```python +import numpy as np + +def generate_drifted_data(df, numerical_features, categorical_features): + """Shift numericals by 3σ, flip 30% of categoricals, inject 15% missing.""" + df = df.copy() + # Numerical shift + n_shift = int(len(df) * 0.5) + idx = np.random.choice(df.index, size=n_shift, replace=False) + for col in numerical_features: + df.loc[idx, col] = df.loc[idx, col] + 3 * df[col].std() + # Categorical flip + n_flip = int(len(df) * 0.3) + idx = np.random.choice(df.index, size=n_flip, replace=False) + for col in categorical_features: + unique_vals = df[col].dropna().unique() + if len(unique_vals) > 1: + df.loc[idx, col] = np.random.choice(unique_vals, size=n_flip) + # Missing values + n_missing = int(len(df) * 0.15) + for col in df.columns: + idx = np.random.choice(df.index, size=n_missing, replace=False) + df.loc[idx, col] = np.nan + return df +``` + +## Step 3: Model Quality Monitor + +Detects degradation in classification/regression metrics. Requires ground truth labels. + +### Baseline + +```python +model_quality_monitor = ModelQualityMonitor( + role=role, + instance_count=1, + instance_type="ml.m5.xlarge", + volume_size_in_gb=20, + max_runtime_in_seconds=1800, + sagemaker_session=sm_session, +) + +# Baseline CSV must have: prediction, probability, label columns +model_quality_monitor.suggest_baseline( + baseline_dataset=mq_baseline_s3_uri, + dataset_format=DatasetFormat.csv(header=True), + output_s3_uri=f"s3://{bucket}/model-quality/baseline-results", + problem_type="BinaryClassification", + inference_attribute="prediction", + probability_attribute="probability", + ground_truth_attribute="label", + wait=True, + logs=False, +) + +mq_constraints = model_quality_monitor.latest_baselining_job.suggested_constraints() +``` + +### Ground Truth Format + +Ground truth must be uploaded as JSONL with event IDs matching data capture records: + +```json +{"groundTruthData": {"data": "0", "encoding": "CSV"}, "eventMetadata": {"eventId": ""}} +{"groundTruthData": {"data": "1", "encoding": "CSV"}, "eventMetadata": {"eventId": ""}} +``` + +Upload to S3 with time-partitioned path: `s3://bucket/ground-truth/YYYY/MM/DD/HH/` + +### Schedule + +```python +endpoint_input = EndpointInput( + endpoint_name=endpoint_name, + destination="/opt/ml/processing/input/endpoint", + inference_attribute="0", # Column index for predicted label + probability_attribute="1", # Column index for probability +) + +model_quality_monitor.create_monitoring_schedule( + monitor_schedule_name="mq-monitor-schedule", + endpoint_input=endpoint_input, + output_s3_uri=f"s3://{bucket}/model-quality/reports", + constraints=mq_constraints, + ground_truth_input=f"s3://{bucket}/ground-truth", + problem_type="BinaryClassification", + schedule_cron_expression=CronExpressionGenerator.hourly(), + enable_cloudwatch_metrics=True, +) +``` + +## Step 4: Model Bias Monitor (Clarify) + +Detects bias drift on protected attributes. Uses SageMaker Clarify under the hood. + +### Baseline + +```python +from sagemaker.core.clarify import BiasConfig, DataConfig, ModelConfig, ModelPredictedLabelConfig + +bias_data_config = DataConfig( + s3_data_input_path=analysis_s3_uri, # CSV with features + label + s3_output_path=f"s3://{bucket}/clarify/bias-results", + label="income", + headers=ALL_FEATURES + ["income"], + dataset_type="text/csv", +) + +bias_config = BiasConfig( + label_values_or_threshold=[1], # Favorable outcome + facet_name="sex", # Protected attribute + facet_values_or_threshold=["Male"], # Reference group +) + +model_config = ModelConfig( + model_name=model_name, + instance_count=1, + instance_type="ml.m5.xlarge", + content_type="text/csv", + accept_type="text/csv", +) + +predicted_label_config = ModelPredictedLabelConfig( + label=0, # Column index for predicted label + probability=1, # Column index for probability + probability_threshold=0.5, +) + +model_bias_monitor = ModelBiasMonitor( + role=role, + instance_count=1, + instance_type="ml.m5.xlarge", + volume_size_in_gb=20, + max_runtime_in_seconds=1800, + sagemaker_session=sm_session, +) + +# Known SDK v3 bug: suggest_baseline may raise AttributeError on sagemaker_session +# The baseline job DOES complete — catch and load constraints from S3 +try: + model_bias_monitor.suggest_baseline( + data_config=bias_data_config, + bias_config=bias_config, + model_config=model_config, + model_predicted_label_config=predicted_label_config, + wait=True, + logs=False, + ) + bias_constraints = model_bias_monitor.latest_baselining_job.suggested_constraints() +except AttributeError as e: + if "sagemaker_session" in str(e): + bias_constraints = Constraints.from_s3_uri( + constraints_file_s3_uri=f"s3://{bucket}/clarify/bias-results/analysis.json", + sagemaker_session=sm_session, + ) + else: + raise +``` + +### Schedule + +```python +bias_endpoint_input = EndpointInput( + endpoint_name=endpoint_name, + destination="/opt/ml/processing/input/endpoint", + inference_attribute="0", + probability_attribute="1", +) + +model_bias_monitor.create_monitoring_schedule( + monitor_schedule_name="bias-monitor-schedule", + endpoint_input=bias_endpoint_input, + output_s3_uri=f"s3://{bucket}/bias/reports", + constraints=bias_constraints, + ground_truth_input=f"s3://{bucket}/ground-truth", + schedule_cron_expression=CronExpressionGenerator.hourly(), + enable_cloudwatch_metrics=True, +) +``` + +## Step 5: Model Explainability Monitor (SHAP) + +Detects drift in feature attributions using SHAP values. + +### SHAP Baseline for Mixed Data + +When building the SHAP baseline row for datasets with both numerical and categorical features, use median for numeric and mode for categorical: + +```python +baseline_row = [ + float(df[c].median()) if c in NUMERICAL_FEATURES else df[c].mode().iloc[0] + for c in ALL_FEATURES +] + +shap_config = SHAPConfig( + baseline=[baseline_row], + num_samples=100, + agg_method="mean_abs", +) +``` + +Do NOT use `.median()` on all columns — it raises `TypeError` on categorical columns. + +### Baseline + +```python +explainability_data_config = DataConfig( + s3_data_input_path=analysis_s3_uri, + s3_output_path=f"s3://{bucket}/clarify/explainability-results", + label="income", + headers=ALL_FEATURES + ["income"], + dataset_type="text/csv", +) + +model_explainability_monitor = ModelExplainabilityMonitor( + role=role, + instance_count=1, + instance_type="ml.m5.xlarge", + volume_size_in_gb=20, + max_runtime_in_seconds=1800, + sagemaker_session=sm_session, +) + +# Same SDK v3 bug as bias monitor — catch and load from S3 +try: + model_explainability_monitor.suggest_baseline( + data_config=explainability_data_config, + model_config=model_config, + explainability_config=shap_config, + wait=True, + logs=False, + ) + explain_constraints = model_explainability_monitor.latest_baselining_job.suggested_constraints() +except AttributeError as e: + if "sagemaker_session" in str(e): + explain_constraints = Constraints.from_s3_uri( + constraints_file_s3_uri=f"s3://{bucket}/clarify/explainability-results/analysis.json", + sagemaker_session=sm_session, + ) + else: + raise +``` + +### Schedule + +```python +explain_endpoint_input = EndpointInput( + endpoint_name=endpoint_name, + destination="/opt/ml/processing/input/endpoint", +) + +model_explainability_monitor.create_monitoring_schedule( + monitor_schedule_name="explain-monitor-schedule", + endpoint_input=explain_endpoint_input, + output_s3_uri=f"s3://{bucket}/explainability/reports", + constraints=explain_constraints, + schedule_cron_expression=CronExpressionGenerator.hourly(), + enable_cloudwatch_metrics=True, +) +``` + +## Inference Handler for Model Monitor + +The endpoint's inference handler must output CSV with `predicted_label,probability` per row. This format is required for all 4 monitor types: + +```python +def model_fn(model_dir): + model = joblib.load(os.path.join(model_dir, "model.joblib")) + encoder = joblib.load(os.path.join(model_dir, "encoder.joblib")) + with open(os.path.join(model_dir, "feature_names.json")) as f: + feature_names = json.load(f) + return {"model": model, "encoder": encoder, "feature_names": feature_names} + +def input_fn(request_body, request_content_type): + if request_content_type == "text/csv": + return pd.read_csv(io.StringIO(request_body), header=None) + raise ValueError(f"Unsupported: {request_content_type}") + +def predict_fn(input_data, model_dict): + model, encoder = model_dict["model"], model_dict["encoder"] + feature_names = model_dict["feature_names"] + if list(input_data.columns) == list(range(len(input_data.columns))): + input_data.columns = feature_names[:len(input_data.columns)] + X_enc = encoder.transform(input_data[feature_names]) + predictions = model.predict(X_enc) + probabilities = model.predict_proba(X_enc)[:, 1] + return np.column_stack([predictions, probabilities]) + +def output_fn(prediction, accept): + out = io.StringIO() + for row in prediction: + out.write(f"{int(row[0])},{row[1]:.6f}\n") + return out.getvalue(), "text/csv" +``` + +Package as `code/inference.py` inside `model.tar.gz`. + +## Invocation + +Always use boto3 `sagemaker-runtime` for invocation — the SDK v3 `Endpoint.get().invoke()` hits the same Pydantic `kms_key_id` bug: + +```python +smr_client = boto3.client("sagemaker-runtime") +response = smr_client.invoke_endpoint( + EndpointName=endpoint_name, + Body=payload, + ContentType="text/csv", + Accept="text/csv", +) +result = response["Body"].read().decode("utf-8") +``` + +## Cleanup + +Delete in this order: schedules → endpoint → endpoint config → model. + +```python +sm_client = boto3.client("sagemaker") + +# Delete all monitoring schedules +for schedule_name in [dq_schedule, mq_schedule, bias_schedule, explain_schedule]: + try: + sm_client.delete_monitoring_schedule(MonitoringScheduleName=schedule_name) + except Exception: + pass + +# Delete endpoint, config, model +sm_client.delete_endpoint(EndpointName=endpoint_name) +sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name) +sm_client.delete_model(ModelName=model_name) +``` + +## Known SDK v3 Bugs (sagemaker==3.5.0) + +| Bug | Impact | Workaround | +|-----|--------|------------| +| `ModelBuilder.deploy()` raises `ValidationError` on `kms_key_id` | Endpoint IS deployed; error in return value deserialization | Catch exception and continue | +| `ClarifyBaseliningJob` raises `AttributeError: 'ProcessingJob' has no attribute 'sagemaker_session'` | Clarify job completes; error in wrapping result | Catch error, load constraints from S3 via `Constraints.from_s3_uri()` | +| `max_runtime_in_seconds >= schedule cadence` raises `ValidationException` | Schedule creation fails | Use 1800s for hourly schedules (cadence = 3600s) | +| `Endpoint.get().invoke()` hits Pydantic `kms_key_id` bug | Cannot use SDK invoke | Use `boto3.client('sagemaker-runtime').invoke_endpoint()` | + +## Operational Notes + +- Data capture latency: 2-5 minutes before files appear in S3 +- Monitoring schedule: runs hourly; first execution may take up to 1 hour after creation +- Clarify containers: 10-20 minutes each for bias and explainability baselining +- Ground truth format: JSONL with `groundTruthData.data` (CSV-encoded label) and `eventMetadata.eventId` (must match capture event IDs exactly) +- Cleanup order: delete monitoring schedules first, then endpoint, then config, then model +- `max_runtime_in_seconds` must be strictly less than the schedule cadence (use 1800 for hourly) + +## Troubleshooting + +### No violations detected +- Ensure enough drifted traffic was sent during the monitoring window +- Data capture has 2-5 min latency — wait before checking +- Check that the monitoring execution completed: `monitor.list_executions()` + +### Schedule creation fails with ValidationException +- `max_runtime_in_seconds` must be < schedule cadence (3600s for hourly) +- Use 1800s as the max runtime + +### Clarify baseline fails with AttributeError +- Known SDK v3 bug — the job DID complete +- Load constraints directly: `Constraints.from_s3_uri(f'{output_uri}/analysis.json', sagemaker_session=sm_session)` + +### Ground truth not matched +- Event IDs in ground truth JSONL must exactly match event IDs from data capture files +- Parse capture files to extract event IDs before generating ground truth +- Upload ground truth to time-partitioned S3 path: `YYYY/MM/DD/HH/` + +### CloudWatch metrics not appearing +- Metrics are published after monitoring execution completes +- Namespace: `/aws/sagemaker/Endpoints/data-metrics` +- Check dimensions: `Endpoint` and `MonitoringSchedule` diff --git a/sagemaker-ai/steering/sdk-v3-reference.md b/sagemaker-ai/steering/sdk-v3-reference.md new file mode 100644 index 0000000..6cfd933 --- /dev/null +++ b/sagemaker-ai/steering/sdk-v3-reference.md @@ -0,0 +1,281 @@ +# SageMaker Python SDK v3 — API Reference + +## Correct Imports + +```python +# Core session and role +from sagemaker.core.helper.session_helper import Session, get_execution_role + +# Image URI lookup +from sagemaker.core import image_uris + +# Training +from sagemaker.train import ModelTrainer +from sagemaker.core.training.configs import ( + Compute, SourceCode, OutputDataConfig, + CheckpointConfig, StoppingCondition, +) + +# Deployment — ModelBuilder (simple HF / JumpStart models) +from sagemaker.serve import ModelBuilder, InferenceSpec, ModelServer +from sagemaker.serve.builder.schema_builder import SchemaBuilder + +# Deployment — Core API (DJL LMI / vLLM / full control) +from sagemaker.core.resources import Model, EndpointConfig, Endpoint +from sagemaker.core.shapes.shapes import ContainerDefinition, ProductionVariant + +# Hyperparameter tuning +from sagemaker.train.tuner import ( + HyperparameterTuner, ContinuousParameter, + IntegerParameter, CategoricalParameter, +) + +# Processing +from sagemaker.core.processing import ScriptProcessor, ProcessingInput, ProcessingOutput + +# Batch transform +from sagemaker.core.transformer import Transformer + +# JumpStart +from sagemaker.core.jumpstart.notebook_utils import list_jumpstart_models +``` + +NEVER import from: `sagemaker.model`, `sagemaker.processing`, `sagemaker.transformer`, +`sagemaker.workflow`, `sagemaker.xgboost`, `sagemaker.sklearn`, `sagemaker.pytorch`. +These are all SDK v2 and removed in v3. + + +## image_uris.retrieve() — Per Framework + +```python +# XGBoost (no py_version or instance_type needed) +image_uri = image_uris.retrieve(framework="xgboost", region=region, version="1.7-1") + +# SKLearn (no py_version or instance_type needed) +image_uri = image_uris.retrieve(framework="sklearn", region=region, version="1.2-1") + +# PyTorch (REQUIRES py_version and instance_type) +image_uri = image_uris.retrieve( + framework="pytorch", region=region, version="2.1", + py_version="py310", instance_type="ml.m5.xlarge", +) + +# AutoGluon (REQUIRES py_version, image_scope, instance_type) +image_uri = image_uris.retrieve( + "autogluon", region=region, version="1.5", + py_version="py312", image_scope="training", + instance_type="ml.m5.2xlarge", +) +``` + +## Training Pattern — Classical ML (XGBoost/SKLearn) + +Complete pattern for training XGBoost or SKLearn models with SDK v3: + +```python +from sagemaker.train import ModelTrainer +from sagemaker.core.helper.session_helper import Session, get_execution_role +from sagemaker.core import image_uris +from sagemaker.core.training.configs import ( + Compute, SourceCode, OutputDataConfig, +) + +session = Session() +region = session.boto_region_name +role = get_execution_role() +bucket = session.default_bucket() + +# Use SKLearn container with XGBoost in requirements.txt +sklearn_image = image_uris.retrieve( + framework="sklearn", region=region, version="1.2-1", +) + +trainer = ModelTrainer( + training_image=sklearn_image, + role=role, + source_code=SourceCode( + source_dir="./training", + entry_script="train.py", + requirements="requirements.txt", + ), + compute=Compute( + instance_type="ml.m5.large", + instance_count=1, + volume_size_in_gb=30, + ), + hyperparameters={ + "n-estimators": "100", + "max-depth": "6", + "learning-rate": "0.1", + "target-column": "target", + }, + output_data_config=OutputDataConfig( + s3_output_path=f"s3://{bucket}/output", + ), + base_job_name="xgboost-training", + sagemaker_session=session, +) + +trainer.train( + input_data_config=[{ + "channel_name": "train", + "data_source": { + "s3_data_source": { + "s3_uri": f"s3://{bucket}/data/train/", + "s3_data_type": "S3Prefix", + } + }, + }], + wait=True, + logs=True, +) +``` + +Key rules for classical ML training scripts: +- Save model to `/opt/ml/model/` (use `SM_MODEL_DIR` env var) +- Read data from `/opt/ml/input/data//` (use `SM_CHANNEL_TRAIN` env var) +- All argparse numeric args must have `type=int` or `type=float` +- Include `model_fn()` in train.py for inference compatibility + +## Deployment Patterns + +### Pattern 1: Core API — DJL LMI / vLLM (LLMs, multimodal) + +Use when you need full control over the container and env vars. + +```python +from sagemaker.core.resources import Model, EndpointConfig, Endpoint +from sagemaker.core.shapes.shapes import ContainerDefinition, ProductionVariant +from sagemaker.core.helper.session_helper import get_execution_role + +role_arn = get_execution_role() + +Model.create( + model_name="my-model", + primary_container=ContainerDefinition( + image=f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128", + environment={ + "HF_MODEL_ID": "meta-llama/Llama-3.1-8B-Instruct", + "OPTION_ROLLING_BATCH": "vllm", + "OPTION_DTYPE": "fp16", + "OPTION_MAX_MODEL_LEN": "4096", + "OPTION_TENSOR_PARALLEL_DEGREE": "1", + }, + ), + execution_role_arn=role_arn, +) + +EndpointConfig.create( + endpoint_config_name="my-endpoint", + production_variants=[ProductionVariant( + variant_name="AllTraffic", + model_name="my-model", + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + initial_variant_weight=1.0, + container_startup_health_check_timeout_in_seconds=900, + model_data_download_timeout_in_seconds=900, + )], +) + +Endpoint.create(endpoint_name="my-endpoint", endpoint_config_name="my-endpoint") + +# Wait +import boto3 +sm = boto3.client("sagemaker") +sm.get_waiter("endpoint_in_service").wait( + EndpointName="my-endpoint", WaiterConfig={"Delay": 30, "MaxAttempts": 60} +) +``` + +### Pattern 2: ModelBuilder — JumpStart / Simple HF Models + +```python +from sagemaker.serve import ModelBuilder +from sagemaker.serve.builder.schema_builder import SchemaBuilder + +builder = ModelBuilder( + model="meta-textgeneration-llama-3-1-8b", # JumpStart model ID + schema_builder=SchemaBuilder( + {"inputs": "Hello", "parameters": {"max_new_tokens": 32}}, + [{"generated_text": "sample"}], + ), + instance_type="ml.g5.2xlarge", +) +builder.build() +endpoint = builder.deploy( + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + endpoint_name="my-endpoint", +) +``` + +### Pattern 3: Classical ML Inference Hooks + +The four hooks called in sequence for every prediction: + +``` +Request → input_fn → predict_fn → output_fn → Response + ↑ + model_fn (called once at startup) +``` + +Package as `code/inference.py` inside `model.tar.gz`: + +```python +import os, json, joblib, numpy as np + +def model_fn(model_dir): + return joblib.load(os.path.join(model_dir, "model.joblib")) + +def input_fn(body, content_type): + if content_type == "application/json": + return np.array(json.loads(body)["instances"]) + raise ValueError(f"Unsupported: {content_type}") + +def predict_fn(data, model): + return model.predict(data) + +def output_fn(prediction, accept): + return json.dumps({"predictions": prediction.tolist()}), "application/json" +``` + +## Invocation + +### Via boto3 (recommended for DJL LMI endpoints) + +```python +import boto3, json +from botocore.config import Config + +runtime = boto3.client("sagemaker-runtime", config=Config(read_timeout=600)) +response = runtime.invoke_endpoint( + EndpointName="my-endpoint", + ContentType="application/json", + Body=json.dumps({"inputs": "Hello", "parameters": {"max_new_tokens": 128}}), +) +result = json.loads(response["Body"].read()) +``` + +### Via SDK v3 Endpoint.invoke + +```python +from sagemaker.core.resources import Endpoint +endpoint = Endpoint.get("my-endpoint") +response = endpoint.invoke( + body=json.dumps(payload), + content_type="application/json", + accept="application/json", +) +``` + +## Cleanup + +```python +# Delete endpoint + config + model +import boto3 +sm = boto3.client("sagemaker") +sm.delete_endpoint(EndpointName="my-endpoint") +sm.delete_endpoint_config(EndpointConfigName="my-endpoint") +sm.delete_model(ModelName="my-model") +``` diff --git a/sagemaker-ai/steering/training-jobs.md b/sagemaker-ai/steering/training-jobs.md new file mode 100644 index 0000000..172cb7a --- /dev/null +++ b/sagemaker-ai/steering/training-jobs.md @@ -0,0 +1,389 @@ +# Training on SageMaker AI + +## Reference Repository + +For production-ready fine-tuning recipes (QLoRA, Spectrum, Full FT, DPO, GRPO/RLVR) with pre-configured YAML configs for Llama, Qwen, DeepSeek, Phi, Gemma, and GPT-OSS models, see: +`https://github.com/aws-samples/amazon-sagemaker-generativeai/tree/main/0_model_customization_recipes` + +Includes one-command training pipeline, multi-node/multi-GPU support, Flash Attention, Liger Kernel, 4-bit quantization, and built-in evaluation with vLLM. + +## Three Approaches to Model Customization + +| Approach | Best For | Complexity | Cost Model | +|----------|----------|------------|------------| +| **Serverless Model Customization** | Quick fine-tuning, no infra management | Low | Pay-per-token | +| **SageMaker Training Jobs** | Flexible training with cost control | Medium | Pay-per-hour | +| **SageMaker HyperPod** | Large-scale, long-running FM training | High | Persistent cluster | + +### Decision Guide + +| Scenario | Recommended | +|----------|-------------| +| Fine-tune < 7B, quick iteration | Serverless Model Customization | +| Fine-tune 7B-70B, cost-sensitive | Training Jobs (with Spot/Trainium) | +| Fine-tune 70B+ | Training Jobs or HyperPod | +| Pre-train foundation model | HyperPod | +| Multi-week continuous training | HyperPod | +| Teams without MLOps expertise | Serverless Model Customization | + +## Serverless Model Customization + +Fully managed capability (launched December 2025) — fine-tune foundation models without infrastructure management. The service automatically provisions and optimizes GPU instances. + +### Supported Models +- Amazon Nova: Micro, Lite, Pro, Lite 2.0 +- Meta Llama: 3.1 8B, 3.1 70B, 2 7B/13B +- Qwen, DeepSeek, GPT-OSS (region-dependent) + +### Supported Techniques +- SFT (Supervised Fine-Tuning) +- DPO (Direct Preference Optimization) +- RLVR (RL with Verifiable Rewards) +- RLAIF (RL from AI Feedback) +- LoRA/QLoRA +- Full-Rank fine-tuning +- CPT (Continued Pre-Training) + +### Key Characteristics +- Data format: JSONL, CSV, Parquet +- Dataset size: 1,000-10,000 rows (model-dependent) +- Deployment: SageMaker Inference or Amazon Bedrock +- Pricing: Pay-per-token +- Regions: us-east-1, us-west-2, eu-west-1, ap-northeast-1 + +### Limitations +- No SSH access to training instances +- TensorBoard only (no MLFlow/WandB) +- No model merging or multi-LoRA adapter support +- Limited model selection (supported models only) + +--- + +## SageMaker Training Jobs + +### When to Use Training Jobs vs HyperPod + +| Criteria | SageMaker Training Jobs | SageMaker HyperPod | +|----------|------------------------|-------------------| +| Best for | Single experiments, LoRA/QLoRA, models ≤70B | Large-scale, full fine-tuning, multi-node, 70B+ | +| Infrastructure | Ephemeral (spins up/down per job) | Persistent cluster (Slurm/EKS) | +| Cost model | Pay per job duration | Pay for cluster uptime | +| Setup complexity | Low (SDK + script) | Higher (cluster config, EKS, operators) | +| Iteration speed | Fast (no cluster to manage) | Slower setup, faster for many sequential jobs | + +## Instance Selection for Training + +### GPU Memory per Technique + +| Technique | Memory per Billion Params | Notes | +|-----------|--------------------------|-------| +| Full Fine-Tuning (BF16) | ~16-18 GB/B | Model + optimizer + gradients | +| LoRA (FP16) | ~4-6 GB/B | Frozen base + adapters | +| QLoRA (4-bit) | ~1-1.5 GB/B | Quantized base + adapters | + +### Instance Matrix for QLoRA (Most Common) + +| Model Size | Recommended Instance | GPUs | Total VRAM | +|------------|---------------------|------|------------| +| 7-8B | ml.g5.2xlarge | 1x A10G | 24 GB | +| 13B | ml.g5.4xlarge | 1x A10G | 24 GB | + +Note: g5.2xlarge → g5.4xlarge adds CPU/RAM only, same 1x A10G 24GB. For 13B QLoRA with seq_len > 1024, consider ml.g5.12xlarge (4x A10G, 96GB). +| 34B | ml.g5.12xlarge | 4x A10G | 96 GB | +| 70B | ml.g5.48xlarge or ml.p4d.24xlarge | 8x A10G / 8x A100 | 192 / 320 GB | + +### Trainium Instances + +| Instance | Chips | HBM | Best For | +|----------|-------|-----|----------| +| ml.trn1.2xlarge | 1 chip (2 cores) | 32 GB | Testing, small models | +| ml.trn1.32xlarge | 16 chips (32 cores) | 512 GB | Production training | + +Trainium has limited architecture support. Verified: `llama`, `qwen3`, `granite`. Always check `model_type` in config.json and the Neuron SDK compatibility matrix for the latest supported list. Variants like `qwen3_vl`, `mllama` are NOT supported. + +## SDK v3 Training Job Launcher Pattern + +```python +from sagemaker.train import ModelTrainer +from sagemaker.core.helper.session_helper import Session, get_execution_role +from sagemaker.core import image_uris +from sagemaker.core.training.configs import ( + Compute, SourceCode, OutputDataConfig, + CheckpointConfig, StoppingCondition, +) + +sess = Session() +region = sess.boto_region_name +role = get_execution_role() +bucket = sess.default_bucket() + +# Get training container image +# ALWAYS check https://aws.github.io/deep-learning-containers/reference/available_images/ +# for the latest version. The SDK may auto-select an older image. +training_image = image_uris.retrieve( + framework="pytorch", + region=region, + version="2.5.1", + py_version="py311", + image_scope="training", + instance_type="ml.g5.2xlarge", +) + +model_trainer = ModelTrainer( + training_image=training_image, + source_code=SourceCode( + source_dir="./scripts", + command="pip install -r requirements.txt && python train_lora.py", + ), + base_job_name="my-training-job", + compute=Compute( + instance_type="ml.g5.2xlarge", + instance_count=1, + volume_size_in_gb=300, + keep_alive_period_in_seconds=1800, # Warm pool for faster restarts + ), + role=role, + environment={ + "MODEL_ID": "meta-llama/Llama-3.1-8B-Instruct", + "NUM_EPOCHS": "3", + "BATCH_SIZE": "2", + "GRADIENT_ACCUMULATION": "4", + "LEARNING_RATE": "2e-4", + "MAX_SEQ_LENGTH": "2048", + "LORA_R": "16", + "LORA_ALPHA": "32", + "SM_CHANNEL_TRAIN": "/opt/ml/input/data/train", + }, + input_data_config=[{ + "channel_name": "train", + "data_source": { + "s3_data_source": { + "s3_uri": f"s3://{bucket}/datasets/train/", + "s3_data_type": "S3Prefix", + } + }, + }], + output_data_config=OutputDataConfig( + s3_output_path=f"s3://{bucket}/training-outputs", + ), + checkpoint_config=CheckpointConfig( + s3_uri=f"s3://{bucket}/checkpoints", + local_path="/opt/ml/checkpoints", + ), + stopping_condition=StoppingCondition( + max_runtime_in_seconds=86400, # 24 hours + ), +) + +model_trainer.train(wait=True, logs=True) +``` + +## Training Script Patterns + +### QLoRA Training Script (GPU) + +Key components for `scripts/train_lora.py`: + +```python +import os, torch +from datasets import load_dataset +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +from peft import LoraConfig, prepare_model_for_kbit_training +from trl import SFTTrainer, SFTConfig + +# Config from env vars (set in ModelTrainer.environment) +model_id = os.environ.get("MODEL_ID") +train_data = os.environ.get("SM_CHANNEL_TRAIN", "/opt/ml/input/data/train") +output_dir = os.environ.get("SM_MODEL_DIR", "/opt/ml/model") + +# 4-bit quantization +bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + bnb_4bit_use_double_quant=True, +) + +model = AutoModelForCausalLM.from_pretrained( + model_id, quantization_config=bnb_config, + device_map="auto", torch_dtype=torch.bfloat16, +) +model = prepare_model_for_kbit_training(model) + +# LoRA config — SFTTrainer applies this automatically +peft_config = LoraConfig( + r=16, lora_alpha=32, lora_dropout=0.05, + target_modules=["q_proj", "k_proj", "v_proj", "o_proj", + "gate_proj", "up_proj", "down_proj"], + bias="none", task_type="CAUSAL_LM", +) + +# Dataset: expects {"messages": [...]} format +dataset = load_dataset("json", data_files={"train": f"{train_data}/train.jsonl"}) + +tokenizer = AutoTokenizer.from_pretrained(model_id) +if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token + +def formatting_func(example): + return tokenizer.apply_chat_template( + example["messages"], tokenize=False, add_generation_prompt=False + ) + +sft_config = SFTConfig( + output_dir=output_dir, + num_train_epochs=3, + per_device_train_batch_size=2, + gradient_accumulation_steps=4, + learning_rate=2e-4, + lr_scheduler_type="cosine", + warmup_ratio=0.1, + bf16=True, + gradient_checkpointing=True, + gradient_checkpointing_kwargs={"use_reentrant": False}, + logging_steps=10, + save_strategy="epoch", + save_total_limit=2, + optim="paged_adamw_8bit", +) + +trainer = SFTTrainer( + model=model, args=sft_config, + train_dataset=dataset["train"], + processing_class=tokenizer, # renamed from tokenizer in trl 0.12+ + formatting_func=formatting_func, + peft_config=peft_config, +) +trainer.train() +trainer.save_model() +tokenizer.save_pretrained(output_dir) +``` + +### Trainium LoRA Training Script + +Key differences from GPU: +- Uses `optimum-neuron` instead of `bitsandbytes` +- BF16 training (no 4-bit quantization on Trainium) +- `attn_implementation="eager"` (no flash attention) +- No `device_map="auto"` — Neuron handles placement + +```python +from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer +from optimum.neuron.models.training.config import TrainingNeuronConfig +from peft import LoraConfig + +trn_config = TrainingNeuronConfig( + tensor_parallel_size=tp_degree, + sequence_parallel_enabled=False, +) + +# NeuronSFTTrainer handles model loading and Neuron compilation +trainer = NeuronSFTTrainer( + model=model_id, + args=NeuronSFTConfig( + output_dir=output_dir, + bf16=True, + tensor_parallel_size=tp_degree, + num_train_epochs=3, + per_device_train_batch_size=1, + gradient_accumulation_steps=16, + learning_rate=2e-4, + lr_scheduler_type="cosine", + warmup_ratio=0.1, + logging_steps=10, + save_strategy="epoch", + save_total_limit=2, + gradient_checkpointing=True, + ), + train_dataset=dataset["train"], + processing_class=tokenizer, + formatting_func=formatting_func, + peft_config=peft_config, +) +``` + +Container for Trainium: +``` +763104351884.dkr.ecr..amazonaws.com/pytorch-training-neuronx:2.8.0-neuronx-py311-sdk2.26.1-ubuntu22.04 +``` + +Launch command for Trainium: +```python +SourceCode( + source_dir="./scripts", + command="pip install -r requirements_trainium.txt && torchrun --nproc_per_node=2 train_lora_trainium.py", +) +``` + +## AWS Recipe-Based Training + +For supported models, use the official AWS SFT recipes from: +`https://github.com/aws-samples/amazon-sagemaker-generativeai/tree/main/0_model_customization_recipes` + +These provide pre-configured YAML recipes for QLoRA, Spectrum, and full fine-tuning with Accelerate + DeepSpeed. Supported models include Llama 3.x, Qwen 2.5/3, DeepSeek R1, Phi-3/4, Gemma 3, and GPT-OSS 20B/120B. + +### Quick Instance Reference (from recipes repo) + +| Model Size | QLoRA | Spectrum | Full FT | +|------------|-------|----------|---------| +| 1-4B | ml.g5.2xlarge | ml.g6e.2xlarge | ml.g6e.2xlarge | +| 7-14B | ml.g6e.2xlarge | ml.g6e.2xlarge / g6e.4xlarge | ml.g6e.2xlarge / g6e.12xlarge | +| 17-32B | ml.p4de.24xlarge | ml.p4de.24xlarge | ml.p4de.24xlarge / p5e.48xlarge | +| 70-120B | ml.p5e.48xlarge | ml.p5e.48xlarge | ml.p5e.48xlarge | +| 600B+ | ml.p5en.48xlarge | ml.p5en.48xlarge | ml.p5en.48xlarge | + +Launch pattern: +```python +SourceCode( + source_dir="./sagemaker_code", + command="./sm_accelerate_train.sh --config recipes/model_recipe.yaml", +) +``` + +## Data Format + +SageMaker training scripts expect data in the SageMaker input channel: + +``` +/opt/ml/input/data/train/train.jsonl +/opt/ml/input/data/train/val.jsonl (optional) +``` + +Standard chat format (recommended): +```json +{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} +``` + +## HyperPod + +For HyperPod cluster setup, Training Operator, Task Governance, and resiliency features, load the dedicated steering file: + +``` +Call action "readSteering" with powerName="sagemaker-ai", steeringFile="hyperpod.md" +``` + +For deploying trained models as inference endpoints on HyperPod (JumpStart, custom models from S3/FSx, autoscaling): + +``` +Call action "readSteering" with powerName="sagemaker-ai", steeringFile="hyperpod-inference.md" +``` + +## Troubleshooting + +### CUDA OOM during training +- Reduce `BATCH_SIZE` to 1 +- Increase `GRADIENT_ACCUMULATION` to maintain effective batch size +- Reduce `MAX_SEQ_LENGTH` +- Use QLoRA instead of LoRA/full fine-tuning +- Use a larger instance type + +### Neuron compilation timeout +- First run on Trainium compiles the model graph (10-20 min) +- Subsequent runs use cached compilation +- Set `NEURON_CC_FLAGS="--model-type transformer"` for faster compilation + +### Training job fails immediately +- Check the CloudWatch logs: `/aws/sagemaker/TrainingJobs` +- Verify the container image exists in your region +- Verify the IAM role has S3 and ECR access +- Check instance quota: `aws service-quotas list-service-quotas --service-code sagemaker`