diff --git a/.claude/commands/blog-meta-image/assets/logos/alibaba.svg b/.claude/commands/blog-meta-image/assets/logos/alibaba.svg
new file mode 100644
index 000000000000..5a9d83e065f8
--- /dev/null
+++ b/.claude/commands/blog-meta-image/assets/logos/alibaba.svg
@@ -0,0 +1,5 @@
+
diff --git a/.claude/commands/blog-meta-image/assets/logos/tailscale.svg b/.claude/commands/blog-meta-image/assets/logos/tailscale.svg
new file mode 100644
index 000000000000..757c85e64c6e
--- /dev/null
+++ b/.claude/commands/blog-meta-image/assets/logos/tailscale.svg
@@ -0,0 +1 @@
+
diff --git a/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/conduit.png b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/conduit.png
new file mode 100644
index 000000000000..b0afb3db2a68
Binary files /dev/null and b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/conduit.png differ
diff --git a/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/index.md b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/index.md
new file mode 100644
index 000000000000..3b4f0a6325a8
--- /dev/null
+++ b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/index.md
@@ -0,0 +1,390 @@
+---
+title: "Use Your GPU For Your Agents: Self-Host Qwen 3.5 with Pulumi and Tailscale"
+allow_long_title: true
+date: 2026-03-20
+draft: false
+meta_desc: |
+ Self-host Qwen 3.5 on your GPU with Pulumi, llama.cpp, and Tailscale. One pulumi up gives you a private OpenAI-compatible API on your tailnet.
+
+meta_image: meta.png
+
+authors:
+- pablo-seibelt
+
+tags:
+- ai
+- kubernetes
+- tailscale
+- python
+
+social:
+ twitter: |
+ Self-host Qwen 3.5 on your own GPU with Pulumi, llama.cpp, and Tailscale. One pulumi up gives you a private OpenAI-compatible API on your tailnet. No cloud costs, no data leaving your network.
+ linkedin: |
+ Self-host Qwen 3.5 on your own GPU with Pulumi, llama.cpp, and Tailscale. One pulumi up gives you a private OpenAI-compatible API on your tailnet, accessible from any device. No cloud costs, no data leaving your network. The post walks through the full stack: llama.cpp for inference, Open WebUI for a chat interface, and Tailscale for secure access from anywhere.
+---
+
+If you run any kind of AI tools and agents, you have probably accepted three tradeoffs: your data leaves your network on every request, you cannot work if your connection drops, and your bill scales with usage no matter how much hardware you already own.
+
+Many open-weight models now run well on consumer GPUs. Once the model is on your machine, your data stays local, inference works offline, and tokens cost nothing. If you already own a compatible machine, you can run a model yourself.
+
+
+
+This post walks through a Kubernetes deployment on a Linux home server. It was tested on a Ryzen 9 5950x with 32 GB DDR4 and an RTX 3080 10 GB, which is high-end 2020 consumer hardware comparable to a mid-range build today. If your rig is in the same ballpark, this setup will likely work for you. If you are on a Mac with an M-series chip, you can run the same model locally with [mlx-lm](https://github.com/ml-explore/mlx-lm) instead.
+
+[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized models distributed in the [GGUF](https://huggingface.co/docs/hub/en/gguf) format, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. GGUF is the file format; quantization (e.g., Q4_K_M) is the compression that shrinks the model by reducing numerical precision.
+
+The full 35B-parameter model fits in around 22 GB at Q4_K_M quantization, and llama.cpp can split layers between GPU VRAM and system RAM so you do not need all of that in VRAM.
+
+In this post we will set up a complete self-hosted inference stack with a single `pulumi up`: [llama.cpp](https://github.com/ggerganov/llama.cpp) serving an OpenAI-compatible API, [Open WebUI](https://github.com/open-webui/open-webui) for a browser chat interface, and [Tailscale](https://tailscale.com/) for secure access from any device on your tailnet, all orchestrated on a local [k3s](https://k3s.io/) Kubernetes cluster.
+
+## Architecture overview
+
+```mermaid
+graph LR
+ subgraph K8s["Kubernetes cluster (k3s)"]
+ LLM["llama-server pod OpenAI-compatible API :30080"]
+ WebUI["Open WebUI :30000"]
+ TS["Tailscale pod"]
+ PVC["PVC model weights"]
+ end
+ PVC -->|"mounted in"| LLM
+ WebUI -->|"connects to"| LLM
+ TS -->|"exposes on tailnet"| WebUI
+ Agents["Local agents"] -->|"localhost"| LLM
+ Phone["Phone / laptop"] -->|"tailnet"| TS
+```
+
+## GPU and model sizing
+
+The table below shows sizes for the [unsloth/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF) quantizations. The total memory column is the combined VRAM + system RAM needed to run the model. llama.cpp splits model layers between GPU VRAM and system RAM automatically, so any GPU that supports CUDA or ROCm will help accelerate inference.
+
+| Quantization | File size | Total memory needed |
+|---|---|---|
+| Q3_K_S | 15.3 GB | ~17 GB |
+| Q4_K_M | 22 GB | ~22 GB |
+| Q6_K | 28.9 GB | ~30 GB |
+
+This walkthrough defaults to **Q4_K_M** because it delivers strong quality while fitting on widely available consumer hardware. Both NVIDIA and AMD GPUs work; adjust the `gpuVendor` config value for your hardware.
+
+With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.
+
+If your machine has less RAM or a smaller GPU, you can try a smaller quantization of the same model (for example, Q3_K_S at 15.3 GB), or switch to a smaller model like the 7B or 14B variants. You can swap the `model` and `modelFile` config values to try a different variant without changing any code.
+
+[llmfit](https://github.com/AlexsJones/llmfit) detects your CPU, RAM, and GPU, then tells you exactly which models and quantizations will run on your machine before you download anything:
+
+```bash
+curl -fsSL https://llmfit.axjns.dev/install.sh | sh
+llmfit
+```
+
+## Prerequisites
+
+{{< notes type="info" >}}
+If this is your first time setting up GPU drivers and k3s, budget around 15 minutes for the prerequisites below. The Pulumi program itself deploys in under 5 minutes.
+{{< /notes >}}
+
+Before you start, make sure you have:
+
+- An NVIDIA or AMD GPU with drivers installed
+ - **NVIDIA**: `nvidia-smi` should work
+ - **AMD**: `amd-smi` should work, plus ROCm drivers with `/dev/kfd` and `/dev/dri` present
+- A local Kubernetes cluster. We will use [k3s](https://k3s.io/):
+
+ ```bash
+ curl -sfL https://get.k3s.io | sh -
+
+ # k3s writes its kubeconfig to a root-owned path; copy it so
+ # kubectl and pulumi can access the cluster without sudo
+ mkdir -p ~/.kube
+ sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
+ sudo chown $USER ~/.kube/config
+ ```
+
+- GPU support in k3s. For **NVIDIA**, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html), then configure the runtime for k3s. Create `/etc/rancher/k3s/config.yaml`:
+
+ ```yaml
+ nvidia-container-runtime-path: /usr/bin/nvidia-container-runtime
+ default-runtime: nvidia
+ ```
+
+ Then configure containerd, restart k3s, and install the device plugin:
+
+ ```bash
+ # Configure the NVIDIA runtime for k3s's embedded containerd
+ sudo nvidia-ctk runtime configure --runtime=containerd \
+ --config=/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
+
+ # Restart k3s to pick up the new runtime
+ sudo systemctl enable --now k3s
+
+ # Install the NVIDIA device plugin
+ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
+ ```
+
+ For **AMD**, install [ROCm drivers](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) first. Verify with `rocminfo` and confirm `/dev/kfd` and `/dev/dri` are present. Then apply the [device plugin](https://github.com/ROCm/k8s-device-plugin):
+
+ ```bash
+ kubectl apply -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml
+ ```
+
+ Verify your GPU is visible to Kubernetes:
+
+ ```bash
+ # NVIDIA
+ kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'
+
+ # AMD
+ kubectl get nodes -o jsonpath='{.items[0].status.capacity.amd\.com/gpu}'
+ ```
+
+ Either command should output `1` (or the number of GPUs you have). If it is empty, check that your GPU device plugin pod is running.
+
+- [Pulumi CLI](/docs/iac/download-install/) and Python 3.9+:
+
+ ```bash
+ curl -fsSL https://get.pulumi.com | sh
+ ```
+
+- A [Tailscale account](https://login.tailscale.com/start) (free tier works)
+
+## The Pulumi program
+
+{{< notes type="tip" >}}
+You could deploy these manifests with `kubectl apply`, but Pulumi buys you a few things: the Tailscale ACL, Kubernetes resources, and config all live in one stack so `pulumi destroy` cleans up everything. The `ComponentResource` lets you swap models or GPU vendors by changing config instead of editing YAML. And the Tailscale auth key is encrypted in state, not sitting in a plaintext file.
+{{< /notes >}}
+
+Create a new project:
+
+```bash
+mkdir self-host-llm && cd self-host-llm
+pulumi new python --name self-host-qwen-llm
+```
+
+Copy the [example program](https://github.com/pulumi/docs/tree/master/static/programs/self-host-qwen-llm-python) into the project directory. Git does not natively support cloning a single folder, so the command uses sparse checkout to avoid downloading the entire repository:
+
+```bash
+git clone --depth 1 --filter=blob:none --sparse \
+ https://github.com/pulumi/docs.git /tmp/pulumi-docs
+git -C /tmp/pulumi-docs sparse-checkout set static/programs/self-host-qwen-llm-python
+cp /tmp/pulumi-docs/static/programs/self-host-qwen-llm-python/* .
+rm -rf /tmp/pulumi-docs
+```
+
+{{< github-card repo="pulumi/docs" branch="master" path="static/programs/self-host-qwen-llm-python" >}}
+
+### How it works
+
+The program is split into two files: `__main__.py` orchestrates the full stack, and `llm_server.py` defines a reusable [ComponentResource](/docs/iac/concepts/resources/components/) that encapsulates the LLM inference server.
+
+#### The LlmServer component
+
+`LlmServer` bundles a PVC, an init container that downloads model weights, the llama-server deployment, and a service into a single reusable component. GPU vendor maps to the right resource key and container image, so switching between NVIDIA and AMD is one config change:
+
+```python
+GPU_RESOURCE_KEYS = {
+ "nvidia": "nvidia.com/gpu",
+ "amd": "amd.com/gpu",
+}
+
+LLAMA_SERVER_IMAGES = {
+ "nvidia": "ghcr.io/ggml-org/llama.cpp:server-cuda",
+ "amd": "ghcr.io/ggml-org/llama.cpp:server-rocm",
+}
+```
+
+The init container uses `uvx` to run `huggingface_hub` without baking it into a custom image. The download is idempotent, so it skips files already on the PVC:
+
+```python
+init_containers = [
+ k8s.core.v1.ContainerArgs(
+ name="download-model",
+ image="ghcr.io/astral-sh/uv:python3.12-bookworm-slim",
+ command=["sh", "-c",
+ f"uvx --from huggingface_hub hf download {model} {download_files} "
+ + f"--local-dir {model_dir}",
+ ],
+ volume_mounts=models_mount,
+ ),
+]
+```
+
+All llama.cpp flags are assembled from config values passed to the constructor, so you can override context size, thread count, or sampling parameters per stack without editing the component:
+
+```python
+config = pulumi.Config()
+model = config.get("model") or "unsloth/Qwen3.5-35B-A3B-GGUF"
+model_file = config.get("modelFile") or "Qwen3.5-35B-A3B-Q4_K_M.gguf"
+context_size = config.get_int("contextSize") or 65536
+
+llm = LlmServer(
+ "llm",
+ model=model,
+ model_file=model_file,
+ port=llm_port,
+ gpu_vendor=gpu_vendor,
+ context_size=context_size,
+ # ...
+)
+```
+
+#### Adopting the Tailscale ACL
+
+The Tailscale ACL is a global singleton per tailnet. It cannot be created or deleted, only updated. The program uses [`import_`](/docs/iac/concepts/options/import/) to adopt the existing ACL into state on first `pulumi up`, and [`retain_on_delete`](/docs/iac/concepts/options/retainondelete/) to prevent `pulumi destroy` from trying to delete it:
+
+```python
+ts_acl = tailscale.Acl(
+ "tailnet-acl",
+ acl=pulumi.Output.json_dumps({
+ "tagOwners": {
+ "tag:llm-server": ["autogroup:admin"],
+ },
+ "acls": [
+ {
+ "action": "accept",
+ "src": ["autogroup:member"],
+ "dst": ["*:*"],
+ },
+ ],
+ }),
+ opts=pulumi.ResourceOptions(
+ import_="acl",
+ retain_on_delete=True,
+ ),
+)
+```
+
+Without these options, destroy+up cycles would fail with a "precondition failed" 412 error.
+
+{{< notes type="info" >}}
+This ACL grants all tailnet members (`autogroup:member`) access to all devices on all ports (`*:*`). This is fine if you are the only user on your tailnet. If you share your tailnet with other people, scope the `dst` field to specific tags and ports (e.g., `tag:llm-server:30000`). Also note that `import_` will **replace your existing tailnet ACL** on first deploy, so export your current rules first if you have custom ones.
+{{< /notes >}}
+
+#### Open WebUI and Tailscale networking
+
+Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.
+
+The Tailscale deployment runs as a separate pod that joins your tailnet and forwards traffic to Open WebUI's ClusterIP. An init container enables IP forwarding, and the main container authenticates using a Pulumi-managed auth key. `TS_DEST_IP` is wired directly to the Open WebUI service's cluster IP using a Pulumi output, so the value is always correct even if Kubernetes reassigns it:
+
+```python
+k8s.core.v1.ContainerArgs(
+ name="tailscale",
+ image="ghcr.io/tailscale/tailscale:latest",
+ env=[
+ k8s.core.v1.EnvVarArgs(
+ name="TS_AUTHKEY",
+ value_from=k8s.core.v1.EnvVarSourceArgs(
+ secret_key_ref=k8s.core.v1.SecretKeySelectorArgs(
+ name="tailscale-auth",
+ key="TS_AUTHKEY",
+ ),
+ ),
+ ),
+ k8s.core.v1.EnvVarArgs(name="TS_HOSTNAME", value=hostname),
+ k8s.core.v1.EnvVarArgs(
+ name="TS_DEST_IP",
+ value=webui_service.spec.cluster_ip,
+ ),
+ # ...
+ ],
+)
+```
+
+Any device on your tailnet can reach the chat interface at `http://:30000` without exposing anything to the public internet.
+
+{{< notes type="warning" >}}
+By default, k3s NodePort services bind to `0.0.0.0`, which means devices on your LAN can also reach port 30000. To restrict access to Tailscale only, add the following to `/etc/rancher/k3s/config.yaml` and restart k3s with `sudo systemctl restart k3s`:
+
+```yaml
+nodeport-addresses: "100.64.0.0/10"
+```
+
+{{< /notes >}}
+
+Configure the Tailscale provider:
+- Generate an API key from [Settings > Keys](https://login.tailscale.com/admin/settings/keys)
+- Find your tailnet name under [Settings > General](https://login.tailscale.com/admin/settings/general)
+
+```bash
+pulumi config set tailscale:apiKey tskey-api-XXXXX --secret
+pulumi config set tailscale:tailnet your-tailnet-name
+```
+
+## Deploy
+
+If you are using an AMD GPU, set the vendor before deploying:
+
+```bash
+pulumi config set gpuVendor amd
+```
+
+The program defaults to `Qwen3.5-35B-A3B-Q4_K_M.gguf`. To use a different quantization, for example:
+
+```bash
+pulumi config set modelFile Qwen3.5-35B-A3B-Q6_K.gguf
+```
+
+Run the deployment:
+
+```bash
+pulumi up
+```
+
+Pulumi shows a preview of all resources it will create. Confirm with `yes`. The first run takes several minutes as the init container downloads the model weights into the PVC.
+
+Once the stack is up, verify llama-server is running:
+
+```bash
+curl http://localhost:30080/v1/models
+```
+
+You should see your model listed. Try a completion:
+
+```bash
+curl http://localhost:30080/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -d '{
+ "model": "unsloth/Qwen3.5-35B-A3B-GGUF",
+ "messages": [{"role": "user", "content": "What is infrastructure as code?"}],
+ "max_tokens": 200
+ }'
+```
+
+Then open `http://localhost:30000` in your browser to access Open WebUI. Select the Qwen model from the model dropdown and start chatting.
+
+## Connect your agents
+
+Any tool that supports the OpenAI API format works out of the box. Point it at your llama-server endpoint:
+
+```bash
+export OPENAI_BASE_URL=http://localhost:30080/v1
+export OPENAI_API_KEY=not-needed
+```
+
+Some examples:
+
+- **[OpenClaw](https://github.com/openclaw/openclaw)**: connect your WhatsApp, Telegram, or Discord to your self-hosted model
+- **[OpenCode](https://opencode.ai/)**: terminal-based coding agent with local LLM support
+
+
+## Access from your phone
+
+Install the [Tailscale app](https://tailscale.com/download) on your phone. Once connected to your tailnet, open the URL from `pulumi stack output tailscale_webui_url` in your mobile browser. Open WebUI works well on mobile and gives you a ChatGPT-like interface backed by your own hardware. For a native app experience, try [Conduit](https://apps.apple.com/us/app/conduit-openwebui-client/id6749840287) ([Android](https://play.google.com/store/apps/details?id=app.cogwheel.conduit)), an Open WebUI client for iOS and Android.
+
+{{< figure src="conduit.png" alt="Conduit app connected to self-hosted Qwen via Tailscale" width="300" >}}
+
+## Conclusion
+
+After following this guide you have:
+
+- An OpenAI-compatible API running on your own GPU via llama.cpp
+- A browser-based chat UI accessible from any device on your tailnet
+- Tailscale ACLs scoping access to your tailnet members
+- Persistent model storage that survives pod restarts
+- Everything running on a local Kubernetes cluster you control
+
+To swap in a new model or quantization, change the `model` and `modelFile` config values and run `pulumi up`. The pod restarts and pulls the new GGUF file.
+
+If you outgrow your local GPU, the same Pulumi program can be adapted to target a cloud Kubernetes cluster. Swap your kubeconfig for a managed K8s service with GPU nodes and `pulumi up` again.
+
+{{< related-posts >}}
diff --git a/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/meta.png b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/meta.png
new file mode 100644
index 000000000000..2b14409196e0
Binary files /dev/null and b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/meta.png differ
diff --git a/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/opencode.png b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/opencode.png
new file mode 100644
index 000000000000..cf995b4eb7d0
Binary files /dev/null and b/content/blog/self-host-qwen-llama-cpp-k8s-tailscale-pulumi/opencode.png differ
diff --git a/data/related.yaml b/data/related.yaml
index dbebfe512118..ab975317aedd 100644
--- a/data/related.yaml
+++ b/data/related.yaml
@@ -368,10 +368,10 @@ tags:
- 2022-03-10-hierarchical-config
ai:
- - codegen-learnings
- - pulumi-copilot-rest
- - copilot-lessons
- - future-cloud-infrastructure-10-trends-shaping-2024-and-beyond
+ - run-deepseek-on-aws-ec2-using-pulumi
+ - policy-next-gen
+ - low-code-llm-apps-with-local-ai-flowise-and-pulumi
+ - easy-ai-apps-with-langserve-and-pulumi
ml:
- devops-ai-developer-future--pulumi-user-group-tech-talks
@@ -1266,6 +1266,11 @@ fargate-vs-ec2:
- easy-ai-apps-with-langserve-and-pulumi
- advanced-aws-networking-part-2
+self-host-qwen-llama-cpp-k8s-tailscale-pulumi:
+ - deploy-openclaw-aws-hetzner
+ - low-code-llm-apps-with-local-ai-flowise-and-pulumi
+ - mlops-huggingface-llm-aws-sagemaker-python
+
when-to-use-azure-cosmos-db:
- azure-deployment-environments
- sam-cogan-testing-best-practices
diff --git a/scripts/programs/ignore.txt b/scripts/programs/ignore.txt
index ea9d87908795..b7afba348122 100644
--- a/scripts/programs/ignore.txt
+++ b/scripts/programs/ignore.txt
@@ -33,6 +33,7 @@ awsx-apigateway-custom-domain-.*
kubernetes-.*
k8s-.*
helm-.*
+self-host-qwen-llm-python
# Skip broken programs to get back to green
# https://github.com/pulumi/docs/issues/14505
diff --git a/static/programs/self-host-qwen-llm-python/Pulumi.yaml b/static/programs/self-host-qwen-llm-python/Pulumi.yaml
new file mode 100644
index 000000000000..d414f47d5d5b
--- /dev/null
+++ b/static/programs/self-host-qwen-llm-python/Pulumi.yaml
@@ -0,0 +1,62 @@
+name: self-host-qwen-llm-python
+description: Self-hosted llama-server (llama.cpp) with Open WebUI and Tailscale
+runtime:
+ name: python
+ options:
+ toolchain: pip
+config:
+ pulumi:tags:
+ value:
+ pulumi:template: python
+ model:
+ type: string
+ default: unsloth/Qwen3.5-35B-A3B-GGUF
+ description: HuggingFace model repository
+ modelFile:
+ type: string
+ default: Qwen3.5-35B-A3B-Q4_K_M.gguf
+ description: GGUF filename to download from the model repo
+ gpuVendor:
+ type: string
+ default: nvidia
+ description: GPU vendor ("nvidia" or "amd")
+ gpuCount:
+ type: integer
+ default: 1
+ description: Number of GPUs to allocate
+ contextSize:
+ type: integer
+ default: 65536
+ description: Context window size in tokens
+ fitTarget:
+ type: integer
+ default: 2048
+ description: VRAM fit target in MB for llama.cpp layer placement
+ threads:
+ type: integer
+ default: 5
+ description: Number of CPU threads for inference
+ jinja:
+ type: boolean
+ default: true
+ description: Enable Jinja template processing for chat templates
+ parallel:
+ type: integer
+ default: 1
+ description: Number of parallel request slots
+ llmPort:
+ type: integer
+ default: 8080
+ description: LLM service port
+ llmNodePort:
+ type: integer
+ default: 30080
+ description: NodePort for external LLM access
+ webuiPort:
+ type: integer
+ default: 30000
+ description: Open WebUI NodePort
+ hostname:
+ type: string
+ default: llm-server
+ description: Tailscale hostname
diff --git a/static/programs/self-host-qwen-llm-python/__main__.py b/static/programs/self-host-qwen-llm-python/__main__.py
new file mode 100644
index 000000000000..7660d1a34688
--- /dev/null
+++ b/static/programs/self-host-qwen-llm-python/__main__.py
@@ -0,0 +1,298 @@
+import pulumi
+import pulumi_kubernetes as k8s
+import pulumi_tailscale as tailscale
+
+from llm_server import LlmServer
+
+config = pulumi.Config()
+gpu_vendor = config.get("gpuVendor") or "nvidia"
+webui_port = config.get_int("webuiPort") or 30000
+llm_port = config.get_int("llmPort") or 8080
+llm_node_port = config.get_int("llmNodePort") or 30080
+hostname = config.get("hostname") or "llm-server"
+model = config.get("model") or "unsloth/Qwen3.5-35B-A3B-GGUF"
+model_file = config.get("modelFile") or "Qwen3.5-35B-A3B-Q4_K_M.gguf"
+context_size = config.get_int("contextSize") or 65536
+fit_target = config.get_int("fitTarget") or 2048
+parallel = config.get_int("parallel") or 1
+threads = config.get_int("threads") or 5
+jinja = config.get_bool("jinja")
+if jinja is None:
+ jinja = True
+
+NAMESPACE = "llm"
+
+ns = k8s.core.v1.Namespace(
+ NAMESPACE,
+ metadata=k8s.meta.v1.ObjectMetaArgs(name=NAMESPACE),
+)
+ns_opts = pulumi.ResourceOptions(depends_on=[ns])
+
+llm = LlmServer(
+ "llm",
+ model=model,
+ model_file=model_file,
+ port=llm_port,
+ gpu_vendor=gpu_vendor,
+ gpu_count=config.get_int("gpuCount") or 1,
+ node_port=llm_node_port,
+ namespace=NAMESPACE,
+ context_size=context_size,
+ fit_target=fit_target,
+ parallel=parallel,
+ threads=threads,
+ jinja=jinja,
+ mmproj=config.get("mmproj"),
+ opts=ns_opts,
+)
+
+# --- Tailscale RBAC (must be created before the Tailscale deployment that
+# references service_account_name="tailscale") ---
+
+ts_sa = k8s.core.v1.ServiceAccount(
+ "tailscale",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="tailscale", namespace=NAMESPACE),
+ opts=ns_opts,
+)
+
+ts_role = k8s.rbac.v1.Role(
+ "tailscale",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="tailscale", namespace=NAMESPACE),
+ rules=[
+ k8s.rbac.v1.PolicyRuleArgs(
+ api_groups=[""],
+ resources=["secrets"],
+ verbs=["create", "get", "update", "patch"],
+ ),
+ ],
+ opts=ns_opts,
+)
+
+ts_role_binding = k8s.rbac.v1.RoleBinding(
+ "tailscale",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="tailscale", namespace=NAMESPACE),
+ subjects=[
+ k8s.rbac.v1.SubjectArgs(
+ kind="ServiceAccount",
+ name="tailscale",
+ namespace=NAMESPACE,
+ ),
+ ],
+ role_ref=k8s.rbac.v1.RoleRefArgs(
+ api_group="rbac.authorization.k8s.io",
+ kind="Role",
+ name="tailscale",
+ ),
+ opts=ns_opts,
+)
+
+# --- Open WebUI ---
+
+webui_labels = {"app": "open-webui"}
+
+webui_pvc = k8s.core.v1.PersistentVolumeClaim(
+ "open-webui-data",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="open-webui-data", namespace=NAMESPACE),
+ spec=k8s.core.v1.PersistentVolumeClaimSpecArgs(
+ access_modes=["ReadWriteOnce"],
+ resources=k8s.core.v1.VolumeResourceRequirementsArgs(
+ requests={"storage": "5Gi"},
+ ),
+ ),
+ opts=ns_opts,
+)
+
+webui_deployment = k8s.apps.v1.Deployment(
+ "open-webui",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="open-webui", namespace=NAMESPACE, labels=webui_labels),
+ spec=k8s.apps.v1.DeploymentSpecArgs(
+ replicas=1,
+ selector=k8s.meta.v1.LabelSelectorArgs(match_labels=webui_labels),
+ template=k8s.core.v1.PodTemplateSpecArgs(
+ metadata=k8s.meta.v1.ObjectMetaArgs(labels=webui_labels),
+ spec=k8s.core.v1.PodSpecArgs(
+ containers=[
+ k8s.core.v1.ContainerArgs(
+ name="open-webui",
+ image="ghcr.io/open-webui/open-webui:main",
+ ports=[k8s.core.v1.ContainerPortArgs(container_port=8080)],
+ env=[
+ k8s.core.v1.EnvVarArgs(
+ name="OPENAI_API_BASE_URLS",
+ value=llm.url,
+ ),
+ k8s.core.v1.EnvVarArgs(name="OPENAI_API_KEYS", value="not-needed"),
+ k8s.core.v1.EnvVarArgs(name="WEBUI_AUTH", value="false"),
+ ],
+ volume_mounts=[
+ k8s.core.v1.VolumeMountArgs(
+ name="data",
+ mount_path="/app/backend/data",
+ ),
+ ],
+ ),
+ ],
+ volumes=[
+ k8s.core.v1.VolumeArgs(
+ name="data",
+ persistent_volume_claim=k8s.core.v1.PersistentVolumeClaimVolumeSourceArgs(
+ claim_name=webui_pvc.metadata.name,
+ ),
+ ),
+ ],
+ ),
+ ),
+ ),
+ opts=pulumi.ResourceOptions(depends_on=[ns, webui_pvc]),
+)
+
+webui_service = k8s.core.v1.Service(
+ "open-webui",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="open-webui", namespace=NAMESPACE),
+ spec=k8s.core.v1.ServiceSpecArgs(
+ selector=webui_labels,
+ type="NodePort",
+ ports=[
+ k8s.core.v1.ServicePortArgs(
+ port=webui_port,
+ target_port=8080,
+ node_port=webui_port,
+ ),
+ ],
+ ),
+ opts=ns_opts,
+)
+
+# --- Tailscale ---
+
+ts_acl = tailscale.Acl(
+ "tailnet-acl",
+ acl=pulumi.Output.json_dumps({
+ "tagOwners": {
+ "tag:llm-server": ["autogroup:admin"],
+ },
+ "acls": [
+ {
+ "action": "accept",
+ "src": ["autogroup:member"],
+ "dst": ["*:*"],
+ },
+ ],
+ }),
+ # The Tailscale ACL is a global singleton per tailnet — it can't be truly
+ # created or deleted, only updated. import_ adopts the existing ACL into
+ # state on first `pulumi up`, and retain_on_delete prevents `pulumi destroy`
+ # from trying to delete it (which would fail or leave the tailnet broken).
+ # Without these, destroy+up cycles fail with a "precondition failed" 412 error.
+ opts=pulumi.ResourceOptions(
+ import_="acl",
+ retain_on_delete=True,
+ ),
+)
+
+ts_key = tailscale.TailnetKey(
+ "llm-server-key",
+ reusable=True,
+ ephemeral=True,
+ preauthorized=True,
+ tags=["tag:llm-server"],
+ description="Pulumi-managed key for LLM server",
+ opts=pulumi.ResourceOptions(depends_on=[ts_acl]),
+)
+
+ts_secret = k8s.core.v1.Secret(
+ "tailscale-auth",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="tailscale-auth", namespace=NAMESPACE),
+ string_data={
+ "TS_AUTHKEY": ts_key.key,
+ },
+ opts=ns_opts,
+)
+
+ts_labels = {"app": "tailscale"}
+
+ts_deployment = k8s.apps.v1.Deployment(
+ "tailscale",
+ metadata=k8s.meta.v1.ObjectMetaArgs(name="tailscale", namespace=NAMESPACE, labels=ts_labels),
+ spec=k8s.apps.v1.DeploymentSpecArgs(
+ replicas=1,
+ selector=k8s.meta.v1.LabelSelectorArgs(match_labels=ts_labels),
+ template=k8s.core.v1.PodTemplateSpecArgs(
+ metadata=k8s.meta.v1.ObjectMetaArgs(labels=ts_labels),
+ spec=k8s.core.v1.PodSpecArgs(
+ service_account_name="tailscale",
+ init_containers=[
+ k8s.core.v1.ContainerArgs(
+ name="sysctler",
+ image="busybox",
+ command=["/bin/sh", "-c"],
+ args=["sysctl -w net.ipv4.ip_forward=1 net.ipv6.conf.all.forwarding=1"],
+ security_context=k8s.core.v1.SecurityContextArgs(
+ privileged=True,
+ ),
+ ),
+ ],
+ containers=[
+ k8s.core.v1.ContainerArgs(
+ name="tailscale",
+ image="ghcr.io/tailscale/tailscale:latest",
+ env=[
+ k8s.core.v1.EnvVarArgs(
+ name="TS_AUTHKEY",
+ value_from=k8s.core.v1.EnvVarSourceArgs(
+ secret_key_ref=k8s.core.v1.SecretKeySelectorArgs(
+ name="tailscale-auth",
+ key="TS_AUTHKEY",
+ ),
+ ),
+ ),
+ k8s.core.v1.EnvVarArgs(name="TS_HOSTNAME", value=hostname),
+ k8s.core.v1.EnvVarArgs(name="TS_STATE_DIR", value="/var/lib/tailscale"),
+ k8s.core.v1.EnvVarArgs(name="TS_USERSPACE", value="false"),
+ k8s.core.v1.EnvVarArgs(
+ name="TS_DEST_IP",
+ value=webui_service.spec.cluster_ip,
+ ),
+ k8s.core.v1.EnvVarArgs(name="TS_KUBE_SECRET", value="tailscale-state"),
+ ],
+ volume_mounts=[
+ k8s.core.v1.VolumeMountArgs(
+ name="tailscale-state",
+ mount_path="/var/lib/tailscale",
+ ),
+ k8s.core.v1.VolumeMountArgs(
+ name="dev-tun",
+ mount_path="/dev/net/tun",
+ ),
+ ],
+ security_context=k8s.core.v1.SecurityContextArgs(
+ privileged=True,
+ ),
+ ),
+ ],
+ volumes=[
+ k8s.core.v1.VolumeArgs(
+ name="tailscale-state",
+ empty_dir=k8s.core.v1.EmptyDirVolumeSourceArgs(),
+ ),
+ k8s.core.v1.VolumeArgs(
+ name="dev-tun",
+ host_path=k8s.core.v1.HostPathVolumeSourceArgs(
+ path="/dev/net/tun",
+ type="CharDevice",
+ ),
+ ),
+ ],
+ ),
+ ),
+ ),
+ # Needs the secret (for TS_AUTHKEY), SA and RBAC (for kube secret access)
+ opts=pulumi.ResourceOptions(depends_on=[ns, ts_secret, ts_sa, ts_role_binding]),
+)
+
+# --- Outputs ---
+
+pulumi.export("local_webui_url", f"http://localhost:{webui_port}")
+pulumi.export("local_api_url", f"http://localhost:{llm_node_port}/v1")
+pulumi.export("tailscale_webui_url", f"http://{hostname}:{webui_port}")
+pulumi.export("model", model)
diff --git a/static/programs/self-host-qwen-llm-python/llm_server.py b/static/programs/self-host-qwen-llm-python/llm_server.py
new file mode 100644
index 000000000000..4a913b7a90ba
--- /dev/null
+++ b/static/programs/self-host-qwen-llm-python/llm_server.py
@@ -0,0 +1,165 @@
+import pulumi
+import pulumi_kubernetes as k8s
+
+GPU_RESOURCE_KEYS = {
+ "nvidia": "nvidia.com/gpu",
+ "amd": "amd.com/gpu",
+}
+
+LLAMA_SERVER_IMAGES = {
+ "nvidia": "ghcr.io/ggml-org/llama.cpp:server-cuda",
+ "amd": "ghcr.io/ggml-org/llama.cpp:server-rocm",
+}
+
+_INTERNAL_PORT = 8080
+
+
+class LlmServer(pulumi.ComponentResource):
+ url: pulumi.Output[str]
+ service: k8s.core.v1.Service
+
+ def __init__(self, name, model, model_file, port, gpu_vendor="nvidia",
+ gpu_count=1, node_port=None, namespace="default",
+ context_size=65536, fit_target=2048, parallel=1,
+ mmproj=None, threads=5, jinja=True, server_args=None,
+ opts=None):
+ super().__init__("selfhost:llm:LlmServer", name, None, opts)
+
+ if gpu_vendor not in GPU_RESOURCE_KEYS:
+ raise ValueError(f"Unsupported gpu_vendor '{gpu_vendor}', must be one of: {', '.join(GPU_RESOURCE_KEYS)}")
+
+ labels = {"app": name}
+ model_dir = "/models"
+ model_path = f"{model_dir}/{model_file}"
+ gpu_resource = GPU_RESOURCE_KEYS[gpu_vendor]
+ image = LLAMA_SERVER_IMAGES[gpu_vendor]
+
+ args = [
+ "-m", model_path,
+ "-c", str(context_size),
+ "--fit-target", str(fit_target),
+ "-fa", "on",
+ "--no-mmap",
+ *(["--jinja"] if jinja else []),
+ "-ctk", "q8_0",
+ "-ctv", "q8_0",
+ "-t", str(threads),
+ "--temp", "1.0",
+ "--top-p", "0.95",
+ "--top-k", "20",
+ "--min-p", "0.00",
+ "--presence-penalty", "1.5",
+ "--repeat-penalty", "1.0",
+ "--port", str(_INTERNAL_PORT),
+ "--host", "0.0.0.0",
+ "--parallel", str(parallel),
+ ]
+ if mmproj:
+ args += ["--mmproj", f"{model_dir}/{mmproj}"]
+ # Escape hatch: pass arbitrary llama.cpp flags without adding constructor params
+ for k, v in (server_args or {}).items():
+ args += [f"--{k}", str(v)]
+
+ download_files = f"{model_file} {mmproj}" if mmproj else model_file
+ models_mount = [k8s.core.v1.VolumeMountArgs(name="models", mount_path=model_dir)]
+
+ init_containers = [
+ k8s.core.v1.ContainerArgs(
+ name="download-model",
+ # Uses uvx to run hf download without baking huggingface-cli into a custom image.
+ # hf download is idempotent — skips files already on the PVC.
+ image="ghcr.io/astral-sh/uv:python3.12-bookworm-slim",
+ command=["sh", "-c",
+ f"uvx --from huggingface_hub hf download {model} {download_files} "
+ + f"--local-dir {model_dir}",
+ ],
+ volume_mounts=models_mount,
+ ),
+ ]
+
+ self.models_pvc = k8s.core.v1.PersistentVolumeClaim(
+ f"{name}-models",
+ metadata=k8s.meta.v1.ObjectMetaArgs(
+ name=f"{name}-models",
+ namespace=namespace,
+ ),
+ spec=k8s.core.v1.PersistentVolumeClaimSpecArgs(
+ access_modes=["ReadWriteOnce"],
+ resources=k8s.core.v1.VolumeResourceRequirementsArgs(
+ requests={"storage": "50Gi"},
+ ),
+ ),
+ opts=pulumi.ResourceOptions(parent=self),
+ )
+
+ self.deployment = k8s.apps.v1.Deployment(
+ name,
+ metadata=k8s.meta.v1.ObjectMetaArgs(
+ name=name,
+ namespace=namespace,
+ labels=labels,
+ ),
+ spec=k8s.apps.v1.DeploymentSpecArgs(
+ replicas=1,
+ progress_deadline_seconds=1800,
+ selector=k8s.meta.v1.LabelSelectorArgs(match_labels=labels),
+ # Recreate strategy: only one pod can hold the GPU at a time
+ strategy=k8s.apps.v1.DeploymentStrategyArgs(type="Recreate"),
+ template=k8s.core.v1.PodTemplateSpecArgs(
+ metadata=k8s.meta.v1.ObjectMetaArgs(labels=labels),
+ spec=k8s.core.v1.PodSpecArgs(
+ init_containers=init_containers,
+ containers=[
+ k8s.core.v1.ContainerArgs(
+ name=name,
+ image=image,
+ args=args,
+ ports=[k8s.core.v1.ContainerPortArgs(container_port=_INTERNAL_PORT)],
+ resources=k8s.core.v1.ResourceRequirementsArgs(
+ limits={gpu_resource: str(gpu_count)},
+ ),
+ volume_mounts=models_mount,
+ ),
+ ],
+ volumes=[
+ k8s.core.v1.VolumeArgs(
+ name="models",
+ persistent_volume_claim=k8s.core.v1.PersistentVolumeClaimVolumeSourceArgs(
+ claim_name=self.models_pvc.metadata.name,
+ ),
+ ),
+ ],
+ ),
+ ),
+ ),
+ opts=pulumi.ResourceOptions(
+ parent=self,
+ depends_on=[self.models_pvc],
+ ),
+ )
+
+ service_spec_args = dict(
+ selector=labels,
+ ports=[
+ k8s.core.v1.ServicePortArgs(
+ port=port,
+ target_port=_INTERNAL_PORT,
+ **({"node_port": node_port} if node_port else {}),
+ ),
+ ],
+ )
+ if node_port:
+ service_spec_args["type"] = "NodePort"
+
+ self.service = k8s.core.v1.Service(
+ name,
+ metadata=k8s.meta.v1.ObjectMetaArgs(
+ name=name,
+ namespace=namespace,
+ ),
+ spec=k8s.core.v1.ServiceSpecArgs(**service_spec_args),
+ opts=pulumi.ResourceOptions(parent=self),
+ )
+
+ self.url = pulumi.Output.concat("http://", name, ":", str(port), "/v1")
+ self.register_outputs({"url": self.url})
diff --git a/static/programs/self-host-qwen-llm-python/requirements.txt b/static/programs/self-host-qwen-llm-python/requirements.txt
new file mode 100644
index 000000000000..01a39d1fc295
--- /dev/null
+++ b/static/programs/self-host-qwen-llm-python/requirements.txt
@@ -0,0 +1,3 @@
+pulumi>=3.0.0,<4.0.0
+pulumi-kubernetes>=4.0.0,<5.0.0
+pulumi-tailscale>=0.17.0,<1.0.0