feat: Blog post - Self-host Qwen 3.5 with Pulumi and Tailscale#17947
feat: Blog post - Self-host Qwen 3.5 with Pulumi and Tailscale#17947
Conversation
…and Tailscale Adds a complete walkthrough for deploying a self-hosted LLM inference stack on a local k3s cluster using Pulumi. Includes a reusable LlmServer ComponentResource, Tailscale networking, Open WebUI, and an example program. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@cnunciato i wasn't sure about the "copying the example" experience so i'm open to suggestions. |
Docs Review - PR #17947Overall this is a well-structured, practical blog post with a complete working Pulumi program. A few issues to address before merging: Issues1. Config default mismatch - contextSize (bug) Pulumi.yaml line 28 sets contextSize default to 16384, but main.py line 24 uses a fallback of 65536. Since Pulumi.yaml provides a default, the or-65536 branch is dead code. However, the blog post code snippet (index.md line 239) also shows 65536, which will confuse readers. Pick one value and make it consistent across Pulumi.yaml, main.py, and the blog narrative. The blog text (index.md line 99) mentions 65536 in the community-recommended parameters section, so the Pulumi.yaml default is likely the one that should change. 2. Missing newline - tailscale.svg The diff shows No newline at end of file for .claude/commands/blog-meta-image/assets/logos/tailscale.svg. All new files should end with a newline. 3. Screenshots missing 1px gray borders Per STYLE-GUIDE.md, partial screenshots should have a 1px gray (#999999) border. Both opencode.png and conduit.png are screenshots that would benefit from borders to distinguish them from the page background. 4. related.yaml ai tag - unrelated change The diff modifies the global ai tag related posts (lines 412-419), replacing four existing entries with four completely different ones. This changes related-post suggestions for every blog post tagged ai, not just this new post. Was this intentional? If so, the rationale should be in the PR description. If not, revert this hunk and keep only the new self-host-qwen-llama-cpp-k8s-tailscale-pulumi block. 5. meta_image text does not match the blog title The meta.png reads "Use Your GPU With Qwen 3.5 With Pulumi And Tailscale" but the actual blog title is "Use Your GPU For Your Agents: Self-Host Qwen 3.5 with Pulumi and Tailscale". Social sharing works best when the image and title align. 6. Conduit screenshot is in Spanish The conduit.png screenshot shows a Spanish-language UI. For an English-language blog, consider replacing with an English-locale screenshot. 7. First mentions missing hyperlinks Per blog review criteria, the first mention of every tool/technology should be hyperlinked. Missing links for:
8. opencode.png uses Markdown image syntax instead of Hugo figure shortcode At index.md line 375, opencode.png uses raw Markdown image syntax, while conduit.png correctly uses the figure shortcode. For consistency and to support width control/borders, consider using figure for both. Minor / Suggestions
Publishing Readiness Checklist
Mention me (@claude) if you would like additional reviews or help fixing any of the above. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Your site preview for commit ff086a8 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-17947-ff086a89.s3-website.us-west-2.amazonaws.com. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Your site preview for commit c295385 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-17947-c295385b.s3-website.us-west-2.amazonaws.com. |
- Fix contextSize default mismatch between Pulumi.yaml and code - Add defaults for modelFile and llmNodePort, remove mmproj from config - Add hyperlink for first mention of GGUF format - Add trailing newline to tailscale.svg - Skip program in CI (requires k8s cluster and Tailscale credentials) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Your site preview for commit bbfafb6 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-17947-bbfafb61.s3-website.us-west-2.amazonaws.com. |
There was a problem hiding this comment.
Good signs overall. The ComponentResource pattern, GPU sizing table, and phone-access angle all clicked.
The main issue: a few security defaults contradict the "private and secure" pitch. And some missing context will leave readers guessing where they shouldn't have to. None of this requires a rewrite — all targeted fixes. Details in the inline comments below.
|
|
||
| This post walks through a Kubernetes deployment on a Linux home server. It was tested on a Ryzen 9 5950x with 32 GB DDR4 and an RTX 3080 10 GB, which is high-end 2020 consumer hardware comparable to a mid-range build today. If your rig is in the same ballpark, this setup will likely work for you. If you are on a Mac with an M-series chip, you can run the same model locally with [mlx-lm](https://github.com/ml-explore/mlx-lm) instead. | ||
|
|
||
| [Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. |
There was a problem hiding this comment.
GGUF is a file format, not a quantization method.
The r/LocalLLaMA crowd will notice this. Getting it right costs one sentence.
| [Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. | |
| [Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized models distributed in the [GGUF](https://huggingface.co/docs/hub/en/gguf) format, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. GGUF is the file format; quantization (e.g., Q4_K_M) is the compression that shrinks the model by reducing numerical precision. |
|
|
||
| This walkthrough defaults to **Q4_K_M** because it delivers strong quality while fitting on widely available consumer hardware. Both NVIDIA and AMD GPUs work; adjust the `gpuVendor` config value for your hardware. | ||
|
|
||
| With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program. |
There was a problem hiding this comment.
Performance claims need measurement conditions.
Unqualified tok/s numbers invite "well actually" replies. Adding conditions makes the claim defensible.
| With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program. | |
| With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation at short-to-medium prompt lengths (~1K tokens) with a mostly empty KV cache. Throughput drops as context fills — expect roughly half the generation speed around the 8K mark. These flags are already configured in the Pulumi program. |
|
|
||
| #### Open WebUI and Tailscale networking | ||
|
|
||
| Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet. |
There was a problem hiding this comment.
Be upfront about WEBUI_AUTH=false.
NodePort binds to 0.0.0.0 by default, so the UI may be reachable from your LAN without Tailscale. Worth calling out.
| Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet. | |
| Open WebUI connects to the LLM server via its cluster-internal URL. Authentication is disabled (`WEBUI_AUTH=false`) because access is gated by the tailnet. **Note:** NodePort binds to `0.0.0.0` by default, so the UI may also be reachable from your LAN without Tailscale. For shared tailnets or multi-user setups, keep authentication enabled and configure an admin account on first launch. |
| ) | ||
| ``` | ||
|
|
||
| Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet. |
There was a problem hiding this comment.
This is the biggest gap between what the post claims and what the code does. The whole security model rests on "only reachable through Tailscale," but NodePort bypasses that for anyone on the same network.
| Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet. | |
| Any device on your tailnet can reach the chat interface at `http://<hostname>:30000`. **Heads up:** k3s NodePort services bind to `0.0.0.0`, which means devices on your LAN can also reach port 30000 — not just tailnet members. To lock this down, set `--nodeport-addresses=100.64.0.0/10` in your k3s server flags, or switch to a `ClusterIP` service with the [Tailscale Kubernetes operator](https://tailscale.com/kb/1236/kubernetes-operator) as ingress. |
| { | ||
| "action": "accept", | ||
| "src": ["autogroup:member"], | ||
| "dst": ["*:*"], |
There was a problem hiding this comment.
ACL is more permissive than described.
The post says "Tailscale ACLs which allow only you to access the service," but autogroup:member + dst: *:* is every human user on the tailnet, to every device, on every port. Fine if you're the only user. Not fine if your partner or roommate is on the same tailnet.
Also: import_ will replace your existing tailnet ACL on first deploy. If you have custom rules, export them first and merge by hand.
Consider adding a note above the code block calling this out, and scoping the dst field to specific tags and ports.
|
|
||
| ## Prerequisites | ||
|
|
||
| Before you start, make sure you have: |
There was a problem hiding this comment.
Set expectations on prereq time.
Every persona flagged the gap between "a single pulumi up" and the actual prereq work. Consider adding something like:
Budget 1-2 hours for first-time GPU + k3s setup. GPU drivers, container toolkit, and runtime config involve kernel modules and at least one reboot. The Pulumi program itself deploys in under 5 minutes.
There was a problem hiding this comment.
1/2 hours seems extreme, it took me a few minutes to run everything, but maybe 15 mins could be reasonable expectations
|
|
||
| - A [Tailscale account](https://login.tailscale.com/start) (free tier works) | ||
|
|
||
| ## The Pulumi program |
There was a problem hiding this comment.
The home-lab audience will immediately ask: why Pulumi over kubectl apply?
Consider adding a short paragraph here:
Why Pulumi here? You could deploy these manifests with
kubectl apply. Pulumi buys you three things: (1) the Tailscale ACL, K8s resources, and config live in one stack, sopulumi destroycleans up everything; (2) the ComponentResource lets you swap models or GPU vendors by changing config, no YAML editing; (3) the Tailscale auth key is encrypted in state, not sitting in a plaintext file. If you already run Flux or ArgoCD, you can export the manifests withpulumi stack exportand feed them into your existing pipeline.
| - Persistent model storage that survives pod restarts | ||
| - Everything running on a local Kubernetes cluster you control | ||
|
|
||
| If you outgrow your local GPU, the same Pulumi program can be adapted to target a cloud Kubernetes cluster. Swap your kubeconfig for a managed K8s service with GPU nodes and `pulumi up` again. |
There was a problem hiding this comment.
Two things missing from the conclusion that every persona asked about:
-
Break-even math. Without it, "no cloud costs" hangs in the air. Something like: If you already own a GPU and spend more than $30-50/month on API calls, self-hosting pays for the electricity pretty quickly (an RTX 3080 under load costs roughly $10-15/month). The privacy and offline benefits apply regardless of the math.
-
How to update models. One sentence answers the obvious follow-up: To swap in a new model or quantization, change the
modelandmodelFileconfig values and runpulumi up. The pod restarts and pulls the new GGUF file.
There was a problem hiding this comment.
I left out the math on purpose because it wildly varies on each person's machine, the chosen model, where they live, etc so it doesn't look like we can give a reliable number here. Cost is probably not the biggest incentive, you can probably run this and other models pretty cheap online if you want to
|
|
||
| - An OpenAI-compatible API running on your own GPU via llama.cpp | ||
| - A browser-based chat UI accessible from any device on your tailnet | ||
| - Tailscale ACLs which allow only you to access the service |
There was a problem hiding this comment.
This claim doesn't match the ACL code, which grants autogroup:member (all tailnet users) access to *:* (all devices, all ports). Consider: "Tailscale ACLs scoping access to your tailnet members" or, better, actually tighten the ACL to match this claim.
|
Thanks for the feedback Engin! I'll definitely improve the tailscale/networking stuff and evaluate the other suggestions |
- Clarify GGUF is a file format, not a quantization method - Add note about ACL permissiveness for shared tailnets - Add prereq time expectations - Add why-Pulumi justification - Add how to update models in conclusion - Warn about NodePort LAN exposure and how to restrict to Tailscale - Update publish date to March 20th Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@dirien fixed! |
|
Your site preview for commit 51be878 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-17947-51be878b.s3-website.us-west-2.amazonaws.com. |
|
@cnunciato if we don't want to publish this i'm fine with that, i can just close it so i don't keep bugging you |
|
@sicarul Not at all, apologies for the delay on it. I'll give it a read today! |
|
With the release of gemma4 i'm wandering if i should re-create this setup with gemma4 instead, but i won't have access to my hardware to redo it again until May. |
|
Working on feedback for this now -- just a heads up. |
Summary
static/programs/self-host-qwen-llm-python/) with anLlmServerComponentResourceimport_/retain_on_delete, and agent integration🤖 Generated with Claude Code