Native MLX inference server for Apple Silicon#103
Native MLX inference server for Apple Silicon#103michaelneale wants to merge 196 commits intomainfrom
Conversation
|
I will align #56 to better suit this PR - I abstracted out backends a little too early. |
Notes requested by mic: TurboQuant KV cache compression for the MLX pathCrawled TheTom/turboquant_plus — an implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp. The core findings are directly applicable to our MLX inference path since we own the full KV cache lifecycle. Why this mattersAt 32K context, KV cache is multiple GB. TurboQuant achieves 3.8–5.1× compression with ~1% PPL impact. That means 4× the context length or 4× concurrent users from the same Mac. The algorithm is simplePolarQuant (the production variant — they dropped QJL, the 1-bit error correction, because it hurts quality through softmax amplification):
The centroids are just constants — Lloyd-Max optimal values for a Gaussian distribution. Literally a Where it plugs inThe current // Current: store raw
self.keys = Some(new_k.clone());
// With TurboQuant: quantize on store, dequant on read
// Store compressed representation (indices + norms) instead of full arrays
// Dequant before returning to scaled_dot_product_attentionAll the ops are expressible through
Key findings from TurboQuant research
Pragmatic pathStart with pure This is completely independent of the llama.cpp path — they share nothing. ggml manages its own KV cache with C structs and Metal shaders; MLX manages its own with Rust arrays. The only shared thing would be the centroid constants. |
In-process Metal GPU inference via mlx-rs (Rust → mlx-c), no Python/subprocess. Drop-in replacement for llama-server when model is a safetensors directory. - mlx/model.rs: Qwen2/Llama transformer, quantized matmul, RoPE, KV cache - mlx/server.rs: OpenAI-compatible HTTP (streaming SSE + blocking) - Auto-detect safetensors dirs, skip rpc-server (solo-only) - cfg(target_os = macos) — zero impact on Linux builds
Without causal masking during prefill, larger models (32B) produced degenerate output (repeating template tokens). Use MLX's built-in ScaledDotProductAttentionMask::Causal for multi-token prefill. Tested: Qwen2.5-32B-Instruct-4bit — correct output, ~23 tok/s.
- EOS detection now uses config.json eos_token_id instead of hardcoded token IDs. Fixes premature stop on Qwen (token 2 = '#', not </s>). - Reverted chunked prefill — concatenation-based KV cache makes it worse not better. Need pre-allocated KV cache for real improvement.
Replace concatenation-based KV cache with pre-allocated buffers using slice assignment (index_mut). Allocates in steps of 256 positions and grows with concatenation only when needed. Critical fix: zeros pre-allocation now uses the same dtype as the incoming KV tensors (typically f16 from quantized matmul), eliminating f32→f16 type promotion on every attention call. Prefill chunk size set to 2048 matching mlx-lm's default. Results on Qwen2.5-32B-Instruct-4bit: - 2k TTFT: 14.2s → 11.8s (matches mlx-lm reference) - 8k TTFT: 85.7s → 57.1s (matches mlx-lm reference) - Decode 200 tokens: 12.5s → 8.7s (~23 tok/s, faster than llama-server)
Keep KV caches + token IDs from the previous request. On the next request, find the longest common prefix and skip re-prefilling those tokens — only forward the new suffix. This is the standard approach used by llama-server (slot reuse), mlx-lm, and Ollama. Critical for agent workloads where each turn shares the same system prompt + conversation history. Results on Qwen2.5-32B-Instruct-4bit, ~1.2k token prompt: - Cold TTFT: 6.2s - Warm TTFT: 0.44s (14x faster, only 9 new tokens to prefill) Also adds KVCache::trim_to() for rewinding cache to a prefix length.
|
Updated after the latest MLX UX, templating, smoke-test, and catalog work:
Still deferred intentionally: actual Gemma MLX model loading/runtime support. That work is tracked separately in #142; only Gemma prompt/template rendering support is included in this PR. |
There was a problem hiding this comment.
Pull request overview
Adds a macOS-only native MLX (Metal) inference backend that can serve supported safetensors model directories in-process, integrating with existing runtime/model resolution flows and adding CI smoke coverage for real MLX models.
Changes:
- Introduces a new
mlxbackend (server, sampling, HF template rendering) and wires it into runtime election/local serving paths on macOS. - Extends model resolution + CLI UX to support MLX artifacts, repo shorthand resolution, and explicit
--gguf/--mlxpreferences plus--gguf-file/--mlx-file. - Adds MLX catalog entries + download sidecar handling, and a macOS CI smoke matrix that runs real inference.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/ci-mlx-smoke-test.sh | Adds a macOS MLX end-to-end smoke script (startup, inference, template checks). |
| mesh-llm/src/runtime/mod.rs | Adds CLI validation, MLX file handling, repo shorthand resolution preference, and skips rpc-server for MLX. |
| mesh-llm/src/runtime/local.rs | Normalizes MLX model names and launches MLX in-process server when applicable. |
| mesh-llm/src/models/resolve.rs | Adds repo shorthand resolution with artifact selection + preference (--gguf/--mlx). |
| mesh-llm/src/models/prompt.rs | Adds prompt behavior inference from HF templates / fallback heuristics. |
| mesh-llm/src/models/mod.rs | Exposes prompt behavior + format preference in the models module API. |
| mesh-llm/src/models/catalog.rs | Adds MLX sidecars + safetensors shard discovery for HF downloads. |
| mesh-llm/src/models/catalog.json | Adds MLX catalog entries (Qwen/Llama/Gemma MLX variants). |
| mesh-llm/src/models/capabilities.rs | Extends capability inference to include .jinja template sources. |
| mesh-llm/src/mlx/* | Adds MLX backend implementation: HTTP server, sampling, template rendering, and test corpus. |
| mesh-llm/src/mesh/mod.rs | Improves served model descriptor inference using known model paths. |
| mesh-llm/src/lib.rs | Enables the mlx module on macOS builds. |
| mesh-llm/src/inference/launch.rs | Adds an “in-process” inference server handle for MLX lifecycle shutdown. |
| mesh-llm/src/inference/election.rs | Routes MLX models to the in-process MLX server on macOS. |
| mesh-llm/src/cli/* | Adds new CLI flags and backend preference options for model download/load commands. |
| mesh-llm/Cargo.toml | Adds macOS-only deps for MLX + template rendering. |
| README.md | Documents MLX catalog usage, repo shorthand, and new flags. |
| MLX_ROADMAP.md | Adds roadmap / scope tracking doc for MLX backend work. |
| AGENTS.md | Adds guidance to avoid parallel Rust build/test/format in a worktree. |
| .github/workflows/ci.yml | Adds a macOS MLX CI job matrix running real inference via the smoke script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # scripts/package-release.sh
# Conflicts: # mesh-llm/src/runtime/local.rs
In-process Metal GPU inference via
mlx-rsfor Apple Silicon. This branch is ported onto the currentmaintree and now includes the follow-up runtime, UX, template, download, catalog, documentation, and smoke-test work needed to make MLX practical to use.What
mlxbackend undermesh-llm/src/mlx/llama-serverrpc-serverfor MLX-backed local serving pathsAdded Since The Initial Port
--gguf-file--mlx-file--gguf--mlx--mlxfor MLX runtime selection from--modelconfig.jsontokenizer.jsontokenizer_config.jsonwhen presentchat_template.jsonwhen presentchat_template.jinjawhen presentmodel.safetensors.index.jsonREADME.mdandMLX_ROADMAP.mdto match current MLX behavior and scopeCLI Examples
Compatibility Matrix
Prompt / Serving Improvements
minijinja, with fallback to built-in templates if compile/render fails.jinja-backed Qwen 3 and Qwen 3 coder templatestemperaturetop_ptop_kseedstopTesting / Smoke Coverage
ENABLE_MACOS_CI=trueso repos without mac runners do not queue foreverRisks / Mitigations
--mlx-file; detection now prefers config metadata and only falls back to header inspection inmlx/model.rs. Repo shorthand also supports explicit--mlx/--ggufselection when a Hugging Face repo is ambiguous.InferenceServerHandlelifecycle, added explicit shutdown handling, and verified withcargo test -p mesh-llm --lib,just build, unit/request-shape smokes, live local MLX smokes, and the macOS end-to-end MLX/GGUF smoke matrices.--mlx(or--mlx-file) and prints an experimental warning that points users at the issues page.Follow-up Fixes Included
config.jsonDeferred
Verification
cargo test -p mesh-llm --libcargo test -p mesh-llm --lib models::resolve::tests -- --nocapturecargo test -p mesh-llm --lib runtime::tests -- --nocapturejust buildmlx-community/Qwen2.5-0.5B-Instruct-4bitmlx-community/Qwen3-0.6B-4bitmlx-community/JOSIE-IT1-Qwen3-0.6B-4bitmlx-community/Llama-3.2-1B-Instruct-4bitmlx-community/gemma-2-2b-it-4bitmlx-community/gemma-3-1b-it-qat-4bitmlx-community/GLM-4-9B-0414-4bitmlx-community/LFM2-350M-4bitunsloth/gemma-4-E4B-it-UD-MLX-4bitNotes
cfg(target_os = "macos"), so non-macOS builds keep the existing backend behaviorMLX_ROADMAP.mdnow tracks the remaining family/runtime gaps and parity work