Docker Model Runner

Docker Model Runner (DMR) makes it easy to manage, run, and deploy AI models using Docker. Designed for developers, Docker Model Runner streamlines the process of pulling, running, and serving large language models (LLMs) and other AI models directly from Docker Hub or any OCI-compliant registry.

Overview

This package supports the Docker Model Runner in Docker Desktop and Docker Engine.

Installation

Docker Desktop (macOS and Windows)

For macOS and Windows, install Docker Desktop:

https://docs.docker.com/desktop/

Docker Model Runner is included in Docker Desktop.

Docker Engine (Linux)

For Linux, install Docker Engine from the official Docker repository:

curl -fsSL https://get.docker.com | sudo bash
sudo usermod -aG docker $USER # give user permission to access docker daemon, relogin to take effect

Docker Model Runner is included in Docker Engine when installed from Docker's official repositories.

Verifying Your Installation

To verify that Docker Model Runner is available:

# Check if the Docker CLI plugin is available
docker model --help

# Check Docker version
docker version

# Check Docker Model Runner version
docker model version

# Run a model to test the full setup
docker model run ai/gemma3 "Hello"

If docker model is not available, see the troubleshooting section below.

Troubleshooting: Docker Installation Source

If you encounter errors like Package 'docker-model-plugin' has no installation candidate or docker model command is not found:

Check your Docker installation source:
```
# Check Docker version
docker version

# Check Docker Model Runner version
docker model version
```
Look for the source in the output. If it shows a package from your distro, you'll need to reinstall from Docker's official repositories.

Remove the distro version and install from Docker's official repository:

# Remove distro version (Ubuntu/Debian example)
sudo apt-get purge docker docker.io containerd runc

# Install from Docker's official repository
curl -fsSL https://get.docker.com | sudo bash

# Verify Docker Model Runner is available
docker model --help

For NVIDIA DGX systems: If Docker came pre-installed, verify it's from Docker's official repositories. If not, follow the reinstallation steps above.

For more details refer to:

https://docs.docker.com/ai/model-runner/get-started/

Prerequisites

Before building from source, ensure you have the following installed:

Go 1.25+ - Required for building both model-runner and model-cli
Git - For cloning repositories
Make - For using the provided Makefiles
Docker (optional) - For building and running containerized versions
CGO dependencies - Required for model-runner's GPU support:
- On macOS: Xcode Command Line Tools (xcode-select --install)
- On Linux: gcc/g++ and development headers
- On Windows: MinGW-w64 or Visual Studio Build Tools

Building the Complete Stack

After cloning, a single make builds everything — the server, CLI plugin, and a dmr convenience wrapper:

make

dmr starts the server on a free port, waits for it to be ready, runs your CLI command, then shuts the server down:

./dmr run ai/smollm2 "Hello, how are you?"
./dmr ls
./dmr run qwen3:0.6B-Q4_0 tell me today's news

These components can also be built, run, and tested separately using the Makefile.

Testing the Complete Stack End-to-End

Note: We use port 13434 in these examples to avoid conflicts with Docker Desktop's built-in Model Runner, which typically runs on port 12434.

Option 1: Manual two-terminal setup

Start model-runner in one terminal:

MODEL_RUNNER_PORT=13434 ./model-runner

Use model-cli in another terminal:

# List available models
MODEL_RUNNER_HOST=http://localhost:13434 ./cmd/cli/model-cli list

# Pull and run a model
MODEL_RUNNER_HOST=http://localhost:13434 ./cmd/cli/model-cli run ai/smollm2 "Hello, how are you?"

Option 2: Using Docker

Build and run model-runner in Docker:

cd model-runner
make docker-build
make docker-run PORT=13434 MODELS_PATH=/path/to/models

Connect with model-cli:

cd cmd/cli
MODEL_RUNNER_HOST=http://localhost:13434 ./model-cli list

Additional Resources

Using the Makefile

This project includes a Makefile to simplify common development tasks. Docker targets require Docker Desktop >= 4.41.0. Run make help for a full list, but the key targets are:

build - Build the Go application
build-cli - Build the CLI (docker-model plugin)
install-cli - Build and install the CLI as a Docker plugin
docs - Generate CLI documentation
run - Run the application locally
clean - Clean build artifacts
test - Run tests
validate-all - Run all CI validations locally (lint, test, shellcheck, go mod tidy)
lint - Run Go linting with golangci-lint
validate - Run shellcheck validation on shell scripts
integration-tests - Run integration tests (requires Docker)
docker-build - Build the Docker image for current platform
docker-run - Run the application in a Docker container with TCP port access and mounted model storage
help - Show all available targets and configuration options

Running in Docker

The application can be run in Docker with the following features enabled by default:

TCP port access (default port 8080)
Persistent model storage in a local models directory

# Run with default settings
make docker-run

# Customize port and model storage location
make docker-run PORT=3000 MODELS_PATH=/path/to/your/models

This will:

Create a models directory in your current working directory (or use the specified path)
Mount this directory into the container
Start the service on port 8080 (or the specified port)
All models downloaded will be stored in the host's models directory and will persist between container runs

llama.cpp integration

The Docker image includes the llama.cpp server binary from the docker/docker-model-backend-llamacpp image. You can specify the version of the image to use by setting the LLAMA_SERVER_VERSION variable. Additionally, you can configure the target OS, architecture, and acceleration type:

# Build with a specific llama.cpp server version
make docker-build LLAMA_SERVER_VERSION=v0.0.4

# Specify all parameters
make docker-build LLAMA_SERVER_VERSION=v0.0.4 LLAMA_SERVER_VARIANT=cpu

Default values:

LLAMA_SERVER_VERSION: latest
LLAMA_SERVER_VARIANT: cpu

Available variants:

cpu: CPU-optimized version
cuda: CUDA-accelerated version for NVIDIA GPUs
rocm: ROCm-accelerated version for AMD GPUs
musa: MUSA-accelerated version for MTHREADS GPUs
cann: CANN-accelerated version for Ascend NPUs

The binary path in the image follows this pattern: /com.docker.llama-server.native.linux.${LLAMA_SERVER_VARIANT}.${TARGETARCH}

vLLM integration

The Docker image also supports vLLM as an alternative inference backend.

Building the vLLM variant

To build a Docker image with vLLM support:

# Build with default settings (vLLM 0.12.0)
make docker-build DOCKER_TARGET=final-vllm BASE_IMAGE=nvidia/cuda:13.0.2-runtime-ubuntu24.04 LLAMA_SERVER_VARIANT=cuda

# Build for specific architecture
docker buildx build \
  --platform linux/amd64 \
  --target final-vllm \
  --build-arg BASE_IMAGE=nvidia/cuda:13.0.2-runtime-ubuntu24.04 \
  --build-arg LLAMA_SERVER_VARIANT=cuda \
  --build-arg VLLM_VERSION=0.12.0 \
  -t docker/model-runner:vllm .

Build Arguments

The vLLM variant supports the following build arguments:

VLLM_VERSION: The vLLM version to install (default: 0.12.0)
VLLM_CUDA_VERSION: The CUDA version suffix for the wheel (default: cu130)
VLLM_PYTHON_TAG: The Python compatibility tag (default: cp38-abi3, compatible with Python 3.8+)

Multi-Architecture Support

The vLLM variant supports both x86_64 (amd64) and aarch64 (arm64) architectures. The build process automatically selects the appropriate prebuilt wheel:

linux/amd64: Uses manylinux1_x86_64 wheels
linux/arm64: Uses manylinux2014_aarch64 wheels

To build for multiple architectures:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --target final-vllm \
  --build-arg BASE_IMAGE=nvidia/cuda:12.9.0-runtime-ubuntu24.04 \
  --build-arg LLAMA_SERVER_VARIANT=cuda \
  -t docker/model-runner:vllm .

Updating to a New vLLM Version

To update to a new vLLM version:

docker buildx build \
  --target final-vllm \
  --build-arg VLLM_VERSION=0.11.1 \
  -t docker/model-runner:vllm-0.11.1 .

The vLLM wheels are sourced from the official vLLM GitHub Releases at https://github.com/vllm-project/vllm/releases, which provides prebuilt wheels for each release version.

API Examples

The Model Runner exposes a REST API that can be accessed via TCP port. You can interact with it using curl commands.

Using the API

When running with docker-run, you can use regular HTTP requests:

# List all available models
curl http://localhost:8080/models

# Create a new model
curl http://localhost:8080/models/create -X POST -d '{"from": "ai/smollm2"}'

# Get information about a specific model
curl http://localhost:8080/models/ai/smollm2

# Chat with a model
curl http://localhost:8080/engines/llama.cpp/v1/chat/completions -X POST -d '{
  "model": "ai/smollm2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
  ]
}'

# Delete a model
curl http://localhost:8080/models/ai/smollm2 -X DELETE

# Get metrics
curl http://localhost:8080/metrics

The response will contain the model's reply:

{
  "id": "chat-12345",
  "object": "chat.completion",
  "created": 1682456789,
  "model": "ai/smollm2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking! How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 16,
    "total_tokens": 40
  }
}

Features

Automatic GPU Detection: Automatically configures NVIDIA GPU support if available
Persistent Caching: Models are cached in ~/.cache/nim (or $LOCAL_NIM_CACHE if set)
Interactive Chat: Supports both single prompt and interactive chat modes
Container Reuse: Existing NIM containers are reused across runs

Example Usage

Single prompt:

docker model run nvcr.io/nim/google/gemma-3-1b-it:latest "Explain quantum computing"

Interactive chat:

docker model run nvcr.io/nim/google/gemma-3-1b-it:latest
> Tell me a joke
...
> /bye

Configuration

NGC_API_KEY: Set this environment variable to authenticate with NVIDIA's services
LOCAL_NIM_CACHE: Override the default cache location (default: ~/.cache/nim)

Technical Details

NIM containers:

Run on port 8000 (localhost only)
Use 16GB shared memory by default
Mount ~/.cache/nim for model caching
Support NVIDIA GPU acceleration when available

Metrics

The Model Runner exposes the metrics endpoint of llama.cpp server at the /metrics endpoint. This allows you to monitor model performance, request statistics, and resource usage.

Accessing Metrics

# Get metrics in Prometheus format
curl http://localhost:8080/metrics

Configuration

Enable metrics (default): Metrics are enabled by default
Disable metrics: Set DISABLE_METRICS=1 environment variable
Monitoring integration: Add the endpoint to your Prometheus configuration

Check METRICS.md for more details.

Kubernetes

Experimental support for running in Kubernetes is available in the form of a Helm chart and static YAML.

If you are interested in a specific Kubernetes use-case, please start a discussion on the issue tracker.

<<<<<<< Updated upstream

dmrlet: Container Orchestrator for AI Inference

dmrlet is a purpose-built container orchestrator for AI inference workloads. Unlike Kubernetes, it focuses exclusively on running stateless inference containers with zero configuration overhead. Multi-GPU mapping "just works" without YAML, device plugins, or node selectors.

Key Features

Feature	Kubernetes	dmrlet
Multi-GPU setup	Device plugins + node selectors + resource limits YAML	`dmrlet serve llama3 --gpus all`
Config overhead	50+ lines of YAML minimum	Zero YAML, CLI-only
Time to first inference	Minutes (pod scheduling, image pull)	Seconds (model already local)
Model management	External (mount PVCs, manage yourself)	Integrated with Docker Model Runner store

Building dmrlet

# Build the dmrlet binary
go build -o dmrlet ./cmd/dmrlet

# Verify it works
./dmrlet --help

Usage

Start the daemon:

# Start in foreground
dmrlet daemon

# With custom socket path
dmrlet daemon --socket /tmp/dmrlet.sock

Serve a model:

# Auto-detect backend and GPUs
dmrlet serve llama3.2

# Specify backend
dmrlet serve llama3.2 --backend vllm

# Specify GPU allocation
dmrlet serve llama3.2 --gpus 0,1
dmrlet serve llama3.2 --gpus all

# Multiple replicas
dmrlet serve llama3.2 --replicas 2

# Backend-specific options
dmrlet serve llama3.2 --ctx-size 4096      # llama.cpp context size
dmrlet serve llama3.2 --gpu-memory 0.8     # vLLM GPU memory utilization

List running models:

dmrlet ps
# MODEL          BACKEND    REPLICAS   GPUS      ENDPOINTS              STATUS
# llama3.2       llama.cpp  1          [0,1,2,3] localhost:30000        healthy

View logs:

dmrlet logs llama3.2        # Last 100 lines
dmrlet logs llama3.2 -f     # Follow logs

Scale replicas:

dmrlet scale llama3.2 4     # Scale to 4 replicas

Stop a model:

dmrlet stop llama3.2
dmrlet stop --all           # Stop all models

Check status:

dmrlet status
# DAEMON: running
# SOCKET: /var/run/dmrlet.sock
#
# GPUs:
#   GPU 0:  NVIDIA A100 80GB  81920MB  (in use: llama3.2)
#   GPU 1:  NVIDIA A100 80GB  81920MB  (available)
#
# MODELS: 1 running

Supported Backends

llama.cpp - Default backend for GGUF models
vLLM - High-throughput serving for safetensors models
SGLang - Fast serving with RadixAttention

Architecture

dmrlet daemon
  ├── GPU Manager      - Auto-detect and allocate GPUs
  ├── Container Manager - Docker-based container lifecycle
  ├── Service Registry  - Endpoint discovery with load balancing
  ├── Health Monitor    - Auto-restart unhealthy containers
  ├── Auto-scaler       - Scale based on QPS/latency/GPU utilization
  └── Log Aggregator    - Centralized log collection

Stashed changes

Community

For general questions and discussion, please use Docker Model Runner's Slack channel.

For discussions about issues/bugs and features, you can use GitHub Issues and Pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 1,700 Commits
.gemini		.gemini
.github		.github
assets		assets
charts/docker-model-runner		charts/docker-model-runner
cmd		cmd
demos		demos
e2e		e2e
llamacpp		llamacpp
pkg		pkg
python/diffusers_server		python/diffusers_server
scripts		scripts
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
METRICS.md		METRICS.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go
version.go		version.go
vllm_backend.go		vllm_backend.go
vllm_backend_stub.go		vllm_backend_stub.go

Folders and files

Latest commit

History

Repository files navigation

Docker Model Runner

Overview

Installation

Docker Desktop (macOS and Windows)

Docker Engine (Linux)

Verifying Your Installation

Troubleshooting: Docker Installation Source

Prerequisites

Building the Complete Stack

Testing the Complete Stack End-to-End

Option 1: Manual two-terminal setup

Option 2: Using Docker

Additional Resources

Using the Makefile

Running in Docker

llama.cpp integration

vLLM integration

Building the vLLM variant

Build Arguments

Multi-Architecture Support

Updating to a New vLLM Version

API Examples

Using the API

Features

Example Usage

Configuration

Technical Details

Metrics

Accessing Metrics

Configuration

Kubernetes

<<<<<<< Updated upstream

dmrlet: Container Orchestrator for AI Inference

Key Features

Building dmrlet

Usage

Supported Backends

Architecture

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors 39

Languages

Packages