feat: Blog post - Self-host Qwen 3.5 with Pulumi and Tailscale by sicarul · Pull Request #17947 · pulumi/docs

sicarul · 2026-03-12T20:30:14Z

Summary

Adds a blog post walking through self-hosting Qwen 3.5 35B-A3B on a local k3s cluster with llama.cpp, Open WebUI, and Tailscale
Includes a reusable Pulumi Python program (static/programs/self-host-qwen-llm-python/) with an LlmServer ComponentResource
Covers GPU sizing, NVIDIA/AMD setup, Tailscale ACL adoption with import_/retain_on_delete, and agent integration
Adds related posts entry and meta image logo assets

🤖 Generated with Claude Code

…and Tailscale Adds a complete walkthrough for deploying a self-hosted LLM inference stack on a local k3s cluster using Pulumi. Includes a reusable LlmServer ComponentResource, Tailscale networking, Open WebUI, and an example program. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sicarul · 2026-03-12T20:32:53Z

@cnunciato i wasn't sure about the "copying the example" experience so i'm open to suggestions.

claude · 2026-03-12T20:34:46Z

Docs Review - PR #17947

Overall this is a well-structured, practical blog post with a complete working Pulumi program. A few issues to address before merging:

Issues

1. Config default mismatch - contextSize (bug)

Pulumi.yaml line 28 sets contextSize default to 16384, but main.py line 24 uses a fallback of 65536. Since Pulumi.yaml provides a default, the or-65536 branch is dead code. However, the blog post code snippet (index.md line 239) also shows 65536, which will confuse readers. Pick one value and make it consistent across Pulumi.yaml, main.py, and the blog narrative. The blog text (index.md line 99) mentions 65536 in the community-recommended parameters section, so the Pulumi.yaml default is likely the one that should change.

2. Missing newline - tailscale.svg

The diff shows No newline at end of file for .claude/commands/blog-meta-image/assets/logos/tailscale.svg. All new files should end with a newline.

3. Screenshots missing 1px gray borders

Per STYLE-GUIDE.md, partial screenshots should have a 1px gray (#999999) border. Both opencode.png and conduit.png are screenshots that would benefit from borders to distinguish them from the page background.

4. related.yaml ai tag - unrelated change

The diff modifies the global ai tag related posts (lines 412-419), replacing four existing entries with four completely different ones. This changes related-post suggestions for every blog post tagged ai, not just this new post. Was this intentional? If so, the rationale should be in the PR description. If not, revert this hunk and keep only the new self-host-qwen-llama-cpp-k8s-tailscale-pulumi block.

5. meta_image text does not match the blog title

The meta.png reads "Use Your GPU With Qwen 3.5 With Pulumi And Tailscale" but the actual blog title is "Use Your GPU For Your Agents: Self-Host Qwen 3.5 with Pulumi and Tailscale". Social sharing works best when the image and title align.

6. Conduit screenshot is in Spanish

The conduit.png screenshot shows a Spanish-language UI. For an English-language blog, consider replacing with an English-locale screenshot.

7. First mentions missing hyperlinks

Per blog review criteria, the first mention of every tool/technology should be hyperlinked. Missing links for:

k3s - first mentioned in the intro paragraph (index.md line 62) without a link (linked later in prerequisites, but first mention should carry the link)
GGUF - mentioned at index.md line 64 but not linked

8. opencode.png uses Markdown image syntax instead of Hugo figure shortcode

At index.md line 375, opencode.png uses raw Markdown image syntax, while conduit.png correctly uses the figure shortcode. For consistency and to support width control/borders, consider using figure for both.

Minor / Suggestions

index.md line 103: The llmfit install uses curl-pipe-sh. Consider adding a note that users should review scripts before piping to shell, or link to the project install docs.
index.md line 373: "OpenClaw" links to https://github.com/openclaw/openclaw. Verify this URL is correct and the repo is publicly accessible.
index.md line 184: "Git does not natively support cloning a single folder". This phrasing could be simplified to just provide the command with a brief inline comment.

Publishing Readiness Checklist

Mention me (@claude) if you would like additional reviews or help fixing any of the above.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pulumi-bot · 2026-03-12T22:23:13Z

Your site preview for commit ff086a8 is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-17947-ff086a89.s3-website.us-west-2.amazonaws.com.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pulumi-bot · 2026-03-12T22:32:49Z

Your site preview for commit c295385 is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-17947-c295385b.s3-website.us-west-2.amazonaws.com.

- Fix contextSize default mismatch between Pulumi.yaml and code - Add defaults for modelFile and llmNodePort, remove mmproj from config - Add hyperlink for first mention of GGUF format - Add trailing newline to tailscale.svg - Skip program in CI (requires k8s cluster and Tailscale credentials) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pulumi-bot · 2026-03-13T03:18:33Z

Your site preview for commit bbfafb6 is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-17947-bbfafb61.s3-website.us-west-2.amazonaws.com.

dirien

Good signs overall. The ComponentResource pattern, GPU sizing table, and phone-access angle all clicked.

The main issue: a few security defaults contradict the "private and secure" pitch. And some missing context will leave readers guessing where they shouldn't have to. None of this requires a rewrite — all targeted fixes. Details in the inline comments below.

dirien · 2026-03-16T22:08:22Z

+
+This post walks through a Kubernetes deployment on a Linux home server. It was tested on a Ryzen 9 5950x with 32 GB DDR4 and an RTX 3080 10 GB, which is high-end 2020 consumer hardware comparable to a mid-range build today. If your rig is in the same ballpark, this setup will likely work for you. If you are on a Mac with an M-series chip, you can run the same model locally with [mlx-lm](https://github.com/ml-explore/mlx-lm) instead.
+
+[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss.


GGUF is a file format, not a quantization method.

The r/LocalLLaMA crowd will notice this. Getting it right costs one sentence.

Suggested change

[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss.

[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized models distributed in the [GGUF](https://huggingface.co/docs/hub/en/gguf) format, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. GGUF is the file format; quantization (e.g., Q4_K_M) is the compression that shrinks the model by reducing numerical precision.

dirien · 2026-03-16T22:08:23Z

+
+This walkthrough defaults to **Q4_K_M** because it delivers strong quality while fitting on widely available consumer hardware. Both NVIDIA and AMD GPUs work; adjust the `gpuVendor` config value for your hardware.
+
+With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.


Performance claims need measurement conditions.

Unqualified tok/s numbers invite "well actually" replies. Adding conditions makes the claim defensible.

Suggested change

With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.

With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation at short-to-medium prompt lengths (~1K tokens) with a mostly empty KV cache. Throughput drops as context fills — expect roughly half the generation speed around the 8K mark. These flags are already configured in the Pulumi program.

dirien · 2026-03-16T22:08:23Z

+
+#### Open WebUI and Tailscale networking
+
+Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.


Be upfront about WEBUI_AUTH=false.

NodePort binds to 0.0.0.0 by default, so the UI may be reachable from your LAN without Tailscale. Worth calling out.

Suggested change

Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.

Open WebUI connects to the LLM server via its cluster-internal URL. Authentication is disabled (`WEBUI_AUTH=false`) because access is gated by the tailnet. **Note:** NodePort binds to `0.0.0.0` by default, so the UI may also be reachable from your LAN without Tailscale. For shared tailnets or multi-user setups, keep authentication enabled and configure an admin account on first launch.

dirien · 2026-03-16T22:08:23Z

+)
+```
+
+Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet.


This is the biggest gap between what the post claims and what the code does. The whole security model rests on "only reachable through Tailscale," but NodePort bypasses that for anyone on the same network.

Suggested change

Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet.

Any device on your tailnet can reach the chat interface at `http://<hostname>:30000`. **Heads up:** k3s NodePort services bind to `0.0.0.0`, which means devices on your LAN can also reach port 30000 — not just tailnet members. To lock this down, set `--nodeport-addresses=100.64.0.0/10` in your k3s server flags, or switch to a `ClusterIP` service with the [Tailscale Kubernetes operator](https://tailscale.com/kb/1236/kubernetes-operator) as ingress.

dirien · 2026-03-16T22:08:23Z

+            {
+                "action": "accept",
+                "src": ["autogroup:member"],
+                "dst": ["*:*"],


ACL is more permissive than described.

The post says "Tailscale ACLs which allow only you to access the service," but autogroup:member + dst: *:* is every human user on the tailnet, to every device, on every port. Fine if you're the only user. Not fine if your partner or roommate is on the same tailnet.

Also: import_ will replace your existing tailnet ACL on first deploy. If you have custom rules, export them first and merge by hand.

Consider adding a note above the code block calling this out, and scoping the dst field to specific tags and ports.

dirien · 2026-03-16T22:08:23Z

+
+## Prerequisites
+
+Before you start, make sure you have:


Set expectations on prereq time.

Every persona flagged the gap between "a single pulumi up" and the actual prereq work. Consider adding something like:

Budget 1-2 hours for first-time GPU + k3s setup. GPU drivers, container toolkit, and runtime config involve kernel modules and at least one reboot. The Pulumi program itself deploys in under 5 minutes.

1/2 hours seems extreme, it took me a few minutes to run everything, but maybe 15 mins could be reasonable expectations

dirien · 2026-03-16T22:08:23Z

+
+- A [Tailscale account](https://login.tailscale.com/start) (free tier works)
+
+## The Pulumi program


The home-lab audience will immediately ask: why Pulumi over kubectl apply?

Consider adding a short paragraph here:

Why Pulumi here? You could deploy these manifests with kubectl apply. Pulumi buys you three things: (1) the Tailscale ACL, K8s resources, and config live in one stack, so pulumi destroy cleans up everything; (2) the ComponentResource lets you swap models or GPU vendors by changing config, no YAML editing; (3) the Tailscale auth key is encrypted in state, not sitting in a plaintext file. If you already run Flux or ArgoCD, you can export the manifests with pulumi stack export and feed them into your existing pipeline.

dirien · 2026-03-16T22:08:23Z

+- Persistent model storage that survives pod restarts
+- Everything running on a local Kubernetes cluster you control
+
+If you outgrow your local GPU, the same Pulumi program can be adapted to target a cloud Kubernetes cluster. Swap your kubeconfig for a managed K8s service with GPU nodes and `pulumi up` again.


Two things missing from the conclusion that every persona asked about:

Break-even math. Without it, "no cloud costs" hangs in the air. Something like: If you already own a GPU and spend more than $30-50/month on API calls, self-hosting pays for the electricity pretty quickly (an RTX 3080 under load costs roughly $10-15/month). The privacy and offline benefits apply regardless of the math.

How to update models. One sentence answers the obvious follow-up: To swap in a new model or quantization, change the model and modelFile config values and run pulumi up. The pod restarts and pulls the new GGUF file.

I left out the math on purpose because it wildly varies on each person's machine, the chosen model, where they live, etc so it doesn't look like we can give a reliable number here. Cost is probably not the biggest incentive, you can probably run this and other models pretty cheap online if you want to

dirien · 2026-03-16T22:08:23Z

+
+- An OpenAI-compatible API running on your own GPU via llama.cpp
+- A browser-based chat UI accessible from any device on your tailnet
+- Tailscale ACLs which allow only you to access the service


This claim doesn't match the ACL code, which grants autogroup:member (all tailnet users) access to *:* (all devices, all ports). Consider: "Tailscale ACLs scoping access to your tailnet members" or, better, actually tighten the ACL to match this claim.

sicarul · 2026-03-16T22:32:00Z

Thanks for the feedback Engin! I'll definitely improve the tailscale/networking stuff and evaluate the other suggestions

- Clarify GGUF is a file format, not a quantization method - Add note about ACL permissiveness for shared tailnets - Add prereq time expectations - Add why-Pulumi justification - Add how to update models in conclusion - Warn about NodePort LAN exposure and how to restrict to Tailscale - Update publish date to March 20th Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sicarul · 2026-03-18T21:49:58Z

@dirien fixed!

pulumi-bot · 2026-03-18T21:58:04Z

Your site preview for commit 51be878 is ready! 🎉

http://www-testing-pulumi-docs-origin-pr-17947-51be878b.s3-website.us-west-2.amazonaws.com.

sicarul · 2026-04-01T23:49:14Z

@cnunciato if we don't want to publish this i'm fine with that, i can just close it so i don't keep bugging you

cnunciato · 2026-04-02T17:10:16Z

@sicarul Not at all, apologies for the delay on it. I'll give it a read today!

sicarul · 2026-04-04T22:22:40Z

With the release of gemma4 i'm wandering if i should re-create this setup with gemma4 instead, but i won't have access to my hardware to redo it again until May.

cnunciato · 2026-04-05T19:48:45Z

Working on feedback for this now -- just a heads up.

sicarul had a problem deploying to testing March 12, 2026 20:30 — with GitHub Actions Failure

sicarul requested a review from cnunciato March 12, 2026 20:32

fix: Resolve lint errors and CI test failure for self-host LLM blog post

ff086a8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sicarul temporarily deployed to testing March 12, 2026 22:15 — with GitHub Actions Inactive

fix: Match Pulumi project name to directory name for CI test

c295385

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sicarul temporarily deployed to testing March 12, 2026 22:24 — with GitHub Actions Inactive

sicarul temporarily deployed to testing March 13, 2026 03:09 — with GitHub Actions Inactive

dirien reviewed Mar 16, 2026

View reviewed changes

sicarul temporarily deployed to testing March 18, 2026 21:49 — with GitHub Actions Inactive

sicarul requested a review from dirien March 18, 2026 21:49


		This post walks through a Kubernetes deployment on a Linux home server. It was tested on a Ryzen 9 5950x with 32 GB DDR4 and an RTX 3080 10 GB, which is high-end 2020 consumer hardware comparable to a mid-range build today. If your rig is in the same ballpark, this setup will likely work for you. If you are on a Mac with an M-series chip, you can run the same model locally with [mlx-lm](https://github.com/ml-explore/mlx-lm) instead.

		[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss.


		This walkthrough defaults to Q4_K_M because it delivers strong quality while fitting on widely available consumer hardware. Both NVIDIA and AMD GPUs work; adjust the `gpuVendor` config value for your hardware.

		With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.


		#### Open WebUI and Tailscale networking

		Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.


		- A [Tailscale account](https://login.tailscale.com/start) (free tier works)

		## The Pulumi program

Conversation

sicarul commented Mar 12, 2026

Summary

Uh oh!

sicarul commented Mar 12, 2026

Uh oh!

claude Bot commented Mar 12, 2026

Docs Review - PR #17947

Issues

Minor / Suggestions

Publishing Readiness Checklist

Uh oh!

pulumi-bot commented Mar 12, 2026

Uh oh!

pulumi-bot commented Mar 12, 2026

Uh oh!

pulumi-bot commented Mar 13, 2026

Uh oh!

dirien left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sicarul commented Mar 16, 2026

Uh oh!

sicarul commented Mar 18, 2026

Uh oh!

pulumi-bot commented Mar 18, 2026

Uh oh!

sicarul commented Apr 1, 2026

Uh oh!

cnunciato commented Apr 2, 2026

Uh oh!

sicarul commented Apr 4, 2026

Uh oh!

cnunciato commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dirien left a comment •

edited

Loading