Skip to content

feat: Blog post - Self-host Qwen 3.5 with Pulumi and Tailscale#17947

Open
sicarul wants to merge 5 commits intomasterfrom
docs-blogpost-self-llm
Open

feat: Blog post - Self-host Qwen 3.5 with Pulumi and Tailscale#17947
sicarul wants to merge 5 commits intomasterfrom
docs-blogpost-self-llm

Conversation

@sicarul
Copy link
Copy Markdown
Contributor

@sicarul sicarul commented Mar 12, 2026

Summary

  • Adds a blog post walking through self-hosting Qwen 3.5 35B-A3B on a local k3s cluster with llama.cpp, Open WebUI, and Tailscale
  • Includes a reusable Pulumi Python program (static/programs/self-host-qwen-llm-python/) with an LlmServer ComponentResource
  • Covers GPU sizing, NVIDIA/AMD setup, Tailscale ACL adoption with import_/retain_on_delete, and agent integration
  • Adds related posts entry and meta image logo assets

🤖 Generated with Claude Code

…and Tailscale

Adds a complete walkthrough for deploying a self-hosted LLM inference stack
on a local k3s cluster using Pulumi. Includes a reusable LlmServer
ComponentResource, Tailscale networking, Open WebUI, and an example program.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sicarul
Copy link
Copy Markdown
Contributor Author

sicarul commented Mar 12, 2026

@cnunciato i wasn't sure about the "copying the example" experience so i'm open to suggestions.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 12, 2026

Docs Review - PR #17947

Overall this is a well-structured, practical blog post with a complete working Pulumi program. A few issues to address before merging:


Issues

1. Config default mismatch - contextSize (bug)

Pulumi.yaml line 28 sets contextSize default to 16384, but main.py line 24 uses a fallback of 65536. Since Pulumi.yaml provides a default, the or-65536 branch is dead code. However, the blog post code snippet (index.md line 239) also shows 65536, which will confuse readers. Pick one value and make it consistent across Pulumi.yaml, main.py, and the blog narrative. The blog text (index.md line 99) mentions 65536 in the community-recommended parameters section, so the Pulumi.yaml default is likely the one that should change.

2. Missing newline - tailscale.svg

The diff shows No newline at end of file for .claude/commands/blog-meta-image/assets/logos/tailscale.svg. All new files should end with a newline.

3. Screenshots missing 1px gray borders

Per STYLE-GUIDE.md, partial screenshots should have a 1px gray (#999999) border. Both opencode.png and conduit.png are screenshots that would benefit from borders to distinguish them from the page background.

4. related.yaml ai tag - unrelated change

The diff modifies the global ai tag related posts (lines 412-419), replacing four existing entries with four completely different ones. This changes related-post suggestions for every blog post tagged ai, not just this new post. Was this intentional? If so, the rationale should be in the PR description. If not, revert this hunk and keep only the new self-host-qwen-llama-cpp-k8s-tailscale-pulumi block.

5. meta_image text does not match the blog title

The meta.png reads "Use Your GPU With Qwen 3.5 With Pulumi And Tailscale" but the actual blog title is "Use Your GPU For Your Agents: Self-Host Qwen 3.5 with Pulumi and Tailscale". Social sharing works best when the image and title align.

6. Conduit screenshot is in Spanish

The conduit.png screenshot shows a Spanish-language UI. For an English-language blog, consider replacing with an English-locale screenshot.

7. First mentions missing hyperlinks

Per blog review criteria, the first mention of every tool/technology should be hyperlinked. Missing links for:

  • k3s - first mentioned in the intro paragraph (index.md line 62) without a link (linked later in prerequisites, but first mention should carry the link)
  • GGUF - mentioned at index.md line 64 but not linked

8. opencode.png uses Markdown image syntax instead of Hugo figure shortcode

At index.md line 375, opencode.png uses raw Markdown image syntax, while conduit.png correctly uses the figure shortcode. For consistency and to support width control/borders, consider using figure for both.


Minor / Suggestions

  • index.md line 103: The llmfit install uses curl-pipe-sh. Consider adding a note that users should review scripts before piping to shell, or link to the project install docs.
  • index.md line 373: "OpenClaw" links to https://github.com/openclaw/openclaw. Verify this URL is correct and the repo is publicly accessible.
  • index.md line 184: "Git does not natively support cloning a single folder". This phrasing could be simplified to just provide the command with a brief inline comment.

Publishing Readiness Checklist

  • more-break present after intro
  • meta_image set and not the default placeholder
  • meta_image text matches the blog title (see issue 5)
  • Author profile exists (data/team/team/pablo-seibelt.toml)
  • All links verified (OpenClaw repo needs verification)
  • Code examples have language specifiers on all fenced blocks
  • No animated GIFs used as meta_image
  • Screenshots have 1px gray borders (see issue 3)
  • Title over 60 chars with allow_long_title true set
  • meta_desc present and reasonable length
  • Complete program structure in /static/programs/ (Pulumi.yaml, main.py, llm_server.py, requirements.txt)

Mention me (@claude) if you would like additional reviews or help fixing any of the above.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pulumi-bot
Copy link
Copy Markdown
Collaborator

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pulumi-bot
Copy link
Copy Markdown
Collaborator

- Fix contextSize default mismatch between Pulumi.yaml and code
- Add defaults for modelFile and llmNodePort, remove mmproj from config
- Add hyperlink for first mention of GGUF format
- Add trailing newline to tailscale.svg
- Skip program in CI (requires k8s cluster and Tailscale credentials)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pulumi-bot
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@dirien dirien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good signs overall. The ComponentResource pattern, GPU sizing table, and phone-access angle all clicked.

The main issue: a few security defaults contradict the "private and secure" pitch. And some missing context will leave readers guessing where they shouldn't have to. None of this requires a rewrite — all targeted fixes. Details in the inline comments below.


This post walks through a Kubernetes deployment on a Linux home server. It was tested on a Ryzen 9 5950x with 32 GB DDR4 and an RTX 3080 10 GB, which is high-end 2020 consumer hardware comparable to a mid-range build today. If your rig is in the same ballpark, this setup will likely work for you. If you are on a Mac with an M-series chip, you can run the same model locally with [mlx-lm](https://github.com/ml-explore/mlx-lm) instead.

[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUF is a file format, not a quantization method.

The r/LocalLLaMA crowd will notice this. Getting it right costs one sentence.

Suggested change
[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized [GGUF](https://huggingface.co/docs/hub/en/gguf) models, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss.
[Qwen 3.5](https://qwen.ai/blog?id=qwen3.5) is an Apache 2.0-licensed model family from Alibaba. The 35B-A3B variant uses a Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per token. Thanks to quantized models distributed in the [GGUF](https://huggingface.co/docs/hub/en/gguf) format, models that would normally require datacenter hardware fit on consumer GPUs with acceptable quality loss. GGUF is the file format; quantization (e.g., Q4_K_M) is the compression that shrinks the model by reducing numerical precision.


This walkthrough defaults to **Q4_K_M** because it delivers strong quality while fitting on widely available consumer hardware. Both NVIDIA and AMD GPUs work; adjust the `gpuVendor` config value for your hardware.

With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance claims need measurement conditions.

Unqualified tok/s numbers invite "well actually" replies. Adding conditions makes the claim defensible.

Suggested change
With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation. These flags are already configured in the Pulumi program.
With [community-recommended llama.cpp parameters](https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/followup_qwen3535ba3b_7_communityrequested/) (`--fit-target`, `-fa on`, `--no-mmap`, `-ctk q8_0`, `-ctv q8_0`), the reference hardware (RTX 3080 10 GB) achieves around 600 tok/s prompt processing and 45 tok/s generation at short-to-medium prompt lengths (~1K tokens) with a mostly empty KV cache. Throughput drops as context fills — expect roughly half the generation speed around the 8K mark. These flags are already configured in the Pulumi program.


#### Open WebUI and Tailscale networking

Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be upfront about WEBUI_AUTH=false.

NodePort binds to 0.0.0.0 by default, so the UI may be reachable from your LAN without Tailscale. Worth calling out.

Suggested change
Open WebUI connects to the LLM server via its cluster-internal URL and disables authentication since it is only reachable through the tailnet.
Open WebUI connects to the LLM server via its cluster-internal URL. Authentication is disabled (`WEBUI_AUTH=false`) because access is gated by the tailnet. **Note:** NodePort binds to `0.0.0.0` by default, so the UI may also be reachable from your LAN without Tailscale. For shared tailnets or multi-user setups, keep authentication enabled and configure an admin account on first launch.

)
```

Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest gap between what the post claims and what the code does. The whole security model rests on "only reachable through Tailscale," but NodePort bypasses that for anyone on the same network.

Suggested change
Any device on your tailnet can reach the chat interface at `http://<hostname>:30000` without exposing anything to the public internet.
Any device on your tailnet can reach the chat interface at `http://<hostname>:30000`. **Heads up:** k3s NodePort services bind to `0.0.0.0`, which means devices on your LAN can also reach port 30000 — not just tailnet members. To lock this down, set `--nodeport-addresses=100.64.0.0/10` in your k3s server flags, or switch to a `ClusterIP` service with the [Tailscale Kubernetes operator](https://tailscale.com/kb/1236/kubernetes-operator) as ingress.

{
"action": "accept",
"src": ["autogroup:member"],
"dst": ["*:*"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACL is more permissive than described.

The post says "Tailscale ACLs which allow only you to access the service," but autogroup:member + dst: *:* is every human user on the tailnet, to every device, on every port. Fine if you're the only user. Not fine if your partner or roommate is on the same tailnet.

Also: import_ will replace your existing tailnet ACL on first deploy. If you have custom rules, export them first and merge by hand.

Consider adding a note above the code block calling this out, and scoping the dst field to specific tags and ports.


## Prerequisites

Before you start, make sure you have:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set expectations on prereq time.

Every persona flagged the gap between "a single pulumi up" and the actual prereq work. Consider adding something like:

Budget 1-2 hours for first-time GPU + k3s setup. GPU drivers, container toolkit, and runtime config involve kernel modules and at least one reboot. The Pulumi program itself deploys in under 5 minutes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/2 hours seems extreme, it took me a few minutes to run everything, but maybe 15 mins could be reasonable expectations


- A [Tailscale account](https://login.tailscale.com/start) (free tier works)

## The Pulumi program
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The home-lab audience will immediately ask: why Pulumi over kubectl apply?

Consider adding a short paragraph here:

Why Pulumi here? You could deploy these manifests with kubectl apply. Pulumi buys you three things: (1) the Tailscale ACL, K8s resources, and config live in one stack, so pulumi destroy cleans up everything; (2) the ComponentResource lets you swap models or GPU vendors by changing config, no YAML editing; (3) the Tailscale auth key is encrypted in state, not sitting in a plaintext file. If you already run Flux or ArgoCD, you can export the manifests with pulumi stack export and feed them into your existing pipeline.

- Persistent model storage that survives pod restarts
- Everything running on a local Kubernetes cluster you control

If you outgrow your local GPU, the same Pulumi program can be adapted to target a cloud Kubernetes cluster. Swap your kubeconfig for a managed K8s service with GPU nodes and `pulumi up` again.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things missing from the conclusion that every persona asked about:

  1. Break-even math. Without it, "no cloud costs" hangs in the air. Something like: If you already own a GPU and spend more than $30-50/month on API calls, self-hosting pays for the electricity pretty quickly (an RTX 3080 under load costs roughly $10-15/month). The privacy and offline benefits apply regardless of the math.

  2. How to update models. One sentence answers the obvious follow-up: To swap in a new model or quantization, change the model and modelFile config values and run pulumi up. The pod restarts and pulls the new GGUF file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out the math on purpose because it wildly varies on each person's machine, the chosen model, where they live, etc so it doesn't look like we can give a reliable number here. Cost is probably not the biggest incentive, you can probably run this and other models pretty cheap online if you want to


- An OpenAI-compatible API running on your own GPU via llama.cpp
- A browser-based chat UI accessible from any device on your tailnet
- Tailscale ACLs which allow only you to access the service
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claim doesn't match the ACL code, which grants autogroup:member (all tailnet users) access to *:* (all devices, all ports). Consider: "Tailscale ACLs scoping access to your tailnet members" or, better, actually tighten the ACL to match this claim.

@sicarul
Copy link
Copy Markdown
Contributor Author

sicarul commented Mar 16, 2026

Thanks for the feedback Engin! I'll definitely improve the tailscale/networking stuff and evaluate the other suggestions

- Clarify GGUF is a file format, not a quantization method
- Add note about ACL permissiveness for shared tailnets
- Add prereq time expectations
- Add why-Pulumi justification
- Add how to update models in conclusion
- Warn about NodePort LAN exposure and how to restrict to Tailscale
- Update publish date to March 20th

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sicarul
Copy link
Copy Markdown
Contributor Author

sicarul commented Mar 18, 2026

@dirien fixed!

@pulumi-bot
Copy link
Copy Markdown
Collaborator

@sicarul
Copy link
Copy Markdown
Contributor Author

sicarul commented Apr 1, 2026

@cnunciato if we don't want to publish this i'm fine with that, i can just close it so i don't keep bugging you

@cnunciato
Copy link
Copy Markdown
Contributor

@sicarul Not at all, apologies for the delay on it. I'll give it a read today!

@sicarul
Copy link
Copy Markdown
Contributor Author

sicarul commented Apr 4, 2026

With the release of gemma4 i'm wandering if i should re-create this setup with gemma4 instead, but i won't have access to my hardware to redo it again until May.

@cnunciato
Copy link
Copy Markdown
Contributor

Working on feedback for this now -- just a heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants