Skip to content

Native MLX inference server for Apple Silicon#103

Draft
michaelneale wants to merge 196 commits intomainfrom
mlx-rs-native
Draft

Native MLX inference server for Apple Silicon#103
michaelneale wants to merge 196 commits intomainfrom
mlx-rs-native

Conversation

@michaelneale
Copy link
Copy Markdown
Collaborator

@michaelneale michaelneale commented Mar 31, 2026

In-process Metal GPU inference via mlx-rs for Apple Silicon. This branch is ported onto the current main tree and now includes the follow-up runtime, UX, template, download, catalog, documentation, and smoke-test work needed to make MLX practical to use.

What

  • adds a macOS-only mlx backend under mesh-llm/src/mlx/
  • detects supported MLX safetensors model directories and serves them in-process instead of launching llama-server
  • skips rpc-server for MLX-backed local serving paths
  • integrates MLX with the current runtime, election, and local model-loading flow
  • uses the current inference-server lifecycle model so in-process MLX servers shut down cleanly

Added Since The Initial Port

  • adds explicit raw file flags:
    • --gguf-file
    • --mlx-file
  • adds backend preference flags for ambiguous Hugging Face repos:
    • --gguf
    • --mlx
  • now requires explicit --mlx for MLX runtime selection from --model
  • prints a one-time experimental startup warning for MLX and points users at the GitHub issues page
  • supports Hugging Face repo shorthand for both runtime load and download when the repo has one clear primary artifact
  • downloads MLX sidecars automatically for MLX snapshot refs:
    • config.json
    • tokenizer.json
    • tokenizer_config.json when present
    • chat_template.json when present
    • chat_template.jinja when present
    • shard files from model.safetensors.index.json
  • adds explicit MLX catalog entries for supported model families
  • broadens HF template compatibility for real Qwen 3, Qwen 3 coder, Kimi, Llama, Gemma, Mistral, GLM, GPT-OSS, and related MLX repos
  • fixes MLX token streaming/decoding for split UTF-8 byte fallback output
  • refreshes README.md and MLX_ROADMAP.md to match current MLX behavior and scope

CLI Examples

# Use an MLX catalog model
mesh-llm --model Qwen3-4B-MLX --mlx
mesh-llm --model Gemma-2-2B-it-MLX --mlx
mesh-llm --model GLM-4-9B-0414-MLX --mlx
mesh-llm --model LFM2-350M-MLX --mlx

# Download an MLX catalog model
mesh-llm models download Qwen3-4B-MLX
mesh-llm models download LFM2-350M-MLX

# Load an exact Hugging Face MLX artifact
mesh-llm --model mlx-community/Qwen2.5-0.5B-Instruct-4bit/model.safetensors.index.json --mlx

# Load from an unambiguous Hugging Face repo shorthand
mesh-llm --model mlx-community/Qwen2.5-0.5B-Instruct-4bit --mlx

# Download from an unambiguous Hugging Face repo shorthand
mesh-llm models download mlx-community/Qwen2.5-0.5B-Instruct-4bit

# Force GGUF or MLX when a repo has multiple candidates
mesh-llm --model some-org/some-repo --gguf
mesh-llm --model some-org/some-repo --mlx

# Raw local files remain explicit
mesh-llm --gguf-file ~/models/custom.gguf
mesh-llm --mlx-file ~/models/qwen-mlx/model.safetensors

Compatibility Matrix

Family / capability GGUF / llama backend Native MLX backend
Llama text Supported Supported
Qwen 2.x text Supported Supported
Qwen 3.x text Supported Supported
Gemma 2 text Supported Supported
Gemma 3 text Supported Supported
Gemma 4 text Supported Supported
GLM4 text Supported Supported
LFM2 text Not a current GGUF target in this PR Supported
DeepSeekV3 / Kimi K2 text Supported where GGUF/runtime path exists Supported in native MLX runtime base; compile/test verified, not live-smoked in CI because public MLX repos are very large
Kimi Linear Not claimed here Supported in native MLX runtime; compile/test verified, not live-smoked yet
gpt-oss Not claimed here Supported in native MLX runtime; compile/test verified, not live-smoked yet
Vision Supported on llama-backed path Not supported yet
Audio Not a first-class runtime path today Not supported
Distributed / split serving Supported on llama-backed path Not supported yet; tracked in #146

Prompt / Serving Improvements

  • uses Hugging Face chat templates when available for MLX models
  • renders HF templates with minijinja, with fallback to built-in templates if compile/render fails
  • adds Gemma prompt-template support on the rendering side
  • improves compatibility for real .jinja-backed Qwen 3 and Qwen 3 coder templates
  • adds reasoning-control support across Qwen3, GLM, Kimi, gpt-oss, and LFM2 template families
  • adds family-aware MLX reasoning-output handling so supported reasoning families default to direct answers when thinking is off
  • fixes streaming token decode for split UTF-8 / byte-fallback output in the MLX server path
  • honors common decoding controls in the MLX server path:
    • temperature
    • top_p
    • top_k
    • seed
    • stop
  • adds clearer logging around HF template load vs fallback behavior

Testing / Smoke Coverage

  • adds a macOS MLX smoke matrix in CI covering:
    • Qwen 2.5
    • Qwen 3
    • JOSIE / Qwen 3
    • Llama 3.2
    • Gemma 2
    • Gemma 3
    • Gemma 4
    • GLM4
    • LFM2
  • adds a matching macOS GGUF smoke matrix for the comparable llama-backed families where practical
  • adds prompt-suite smoke coverage so both MLX and GGUF paths get multiple trivial-output probes rather than a single happy-path prompt
  • adds unit and request-shape smoke coverage for:
    • Llama-style prompts
    • Qwen-style prompts
    • Gemma-style prompts
    • tool-aware HF chat templates
    • split UTF-8 streaming decode
    • MLX reasoning-output handling
  • adds a real HF template corpus covering 15 MLX repos
  • macOS CI jobs are gated behind ENABLE_MACOS_CI=true so repos without mac runners do not queue forever

Risks / Mitigations

Risk Recommendation Mitigation
Safetensors auto-detection is still a real decision point, but it is now materially narrower than before. The branch now prefers trusted config metadata and only falls back to safetensors-header inspection when needed. Keep the current approach: prefer HF/config metadata first, use header inspection only as fallback, and fail clearly when a safetensors model is present but not supported by the MLX backend. ✅ Added explicit --mlx-file; detection now prefers config metadata and only falls back to header inspection in mlx/model.rs. Repo shorthand also supports explicit --mlx / --gguf selection when a Hugging Face repo is ambiguous.
MLX bypasses the existing rpc/split path for supported MLX models on macOS. Treat this as an intentional current limitation, not an accidental regression. Keep it documented and track future distributed MLX support separately. See #146. ✅ This is by design in the current rollout. Supported MLX models route to the local MLX backend, while GGUF models continue to use the llama.cpp path that supports rpc/split distribution. Follow-up tracked in #146.
MLX serving behavior is not yet feature-parity with llama.cpp. Merge with clear scope: position MLX as a supported backend for a subset of models/workflows, not a full behavioral clone yet. Track full parity as future work rather than blocking on it. ✅ This now covers Llama, Qwen2, Qwen3, Gemma2/3/4, GLM4, and LFM2 text serving, plus Hugging Face chat templates, reasoning-aware template and response controls, and streaming-safe token decoding.
Some larger MLX families are now compile/runtime-supported but not practical to live-smoke in CI because the public repos are too large or not yet operationally reasonable for the current matrix. Be explicit about the difference between compile/test validation and live-smoke validation. Add real live validation later when smaller practical targets exist or dedicated hardware is available. ✅ DeepSeekV3 / Kimi-K2, gpt-oss, and Kimi Linear are covered by model tests, but not yet part of the live smoke matrix.
In-process MLX server lifecycle is new control-path logic compared with the older subprocess-based llama flow. Keep the current shutdown/test coverage, and add targeted regression tests later around startup/shutdown, reload, and interrupted requests if MLX becomes a maintained backend. ✅ Adapted MLX to the current InferenceServerHandle lifecycle, added explicit shutdown handling, and verified with cargo test -p mesh-llm --lib, just build, unit/request-shape smokes, live local MLX smokes, and the macOS end-to-end MLX/GGUF smoke matrices.
MLX could be chosen accidentally when a user really intended the more mature GGUF path. Make MLX explicit and set expectations clearly when users opt in. ✅ MLX runtime selection now requires explicit --mlx (or --mlx-file) and prints an experimental warning that points users at the issues page.

Follow-up Fixes Included

  • causal mask fix for prefill on larger models
  • EOS token loading from config.json
  • pre-allocated KV cache with dtype fixes
  • prompt KV cache reuse between requests
  • support for Gemma 2 / 3 / 4 text, GLM4 text, LFM2 text, gpt-oss, and Kimi Linear in the native MLX runtime
  • explicit MLX runtime opt-in and updated documentation

Deferred

Verification

  • cargo test -p mesh-llm --lib
  • cargo test -p mesh-llm --lib models::resolve::tests -- --nocapture
  • cargo test -p mesh-llm --lib runtime::tests -- --nocapture
  • just build
  • live local MLX smokes for:
    • mlx-community/Qwen2.5-0.5B-Instruct-4bit
    • mlx-community/Qwen3-0.6B-4bit
    • mlx-community/JOSIE-IT1-Qwen3-0.6B-4bit
    • mlx-community/Llama-3.2-1B-Instruct-4bit
    • mlx-community/gemma-2-2b-it-4bit
    • mlx-community/gemma-3-1b-it-qat-4bit
    • mlx-community/GLM-4-9B-0414-4bit
    • mlx-community/LFM2-350M-4bit
    • unsloth/gemma-4-E4B-it-UD-MLX-4bit

Notes

  • guarded by cfg(target_os = "macos"), so non-macOS builds keep the existing backend behavior
  • GGUF models continue to use the existing llama.cpp path
  • supported MLX models on macOS use the local-only MLX path today
  • proxy routing stays unchanged; MLX serves the same local OpenAI-compatible HTTP surface used by the existing runtime
  • MLX_ROADMAP.md now tracks the remaining family/runtime gaps and parity work

@i386
Copy link
Copy Markdown
Collaborator

i386 commented Mar 31, 2026

I will align #56 to better suit this PR - I abstracted out backends a little too early.

@michaelneale
Copy link
Copy Markdown
Collaborator Author

Notes requested by mic: TurboQuant KV cache compression for the MLX path

Crawled TheTom/turboquant_plus — an implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp. The core findings are directly applicable to our MLX inference path since we own the full KV cache lifecycle.

Why this matters

At 32K context, KV cache is multiple GB. TurboQuant achieves 3.8–5.1× compression with ~1% PPL impact. That means 4× the context length or 4× concurrent users from the same Mac.

The algorithm is simple

PolarQuant (the production variant — they dropped QJL, the 1-bit error correction, because it hurts quality through softmax amplification):

  1. WHT rotation — Walsh-Hadamard butterfly + random sign flips. Makes any distribution Gaussian. O(d log d). ~40 lines.
  2. Extract norm — store ||v|| per block, normalize.
  3. Quantize — snap each rotated coordinate to nearest centroid from a precomputed table (8 entries for 3-bit, 16 for 4-bit). Pack indices into bytes.
  4. Dequant — unpack, lookup centroids, apply norm correction (rescale so reconstructed vector has original norm), inverse WHT.

The centroids are just constants — Lloyd-Max optimal values for a Gaussian distribution. Literally a [f32; 8] or [f32; 16].

Where it plugs in

The current KVCache struct stores Option<Array> (full precision, concatenated every token). The compression slots into update():

// Current: store raw
self.keys = Some(new_k.clone());

// With TurboQuant: quantize on store, dequant on read
// Store compressed representation (indices + norms) instead of full arrays
// Dequant before returning to scaled_dot_product_attention

All the ops are expressible through mlx_rs::ops:

  • WHT butterfly → sequence of add/subtract on array slices (or a small custom Metal kernel)
  • Sign flips → elementwise multiply with precomputed ±1 vector
  • Centroid lookup → take op on a tiny table
  • Norm extraction → mlx_rs::ops::sqrt + sum of squares

Key findings from TurboQuant research

  1. Asymmetric K/V is important — K precision dominates quality (softmax amplifies K errors exponentially). V tolerates aggressive compression (errors scale linearly). Best config: keep K at higher precision, compress V harder.

  2. Sparse V dequant — skip V dequantization when attention weight < 1e-6. At 32K context, 90%+ of weights are negligible. +22.8% decode speedup, zero quality cost. This would require either a custom attention implementation or a hook into scaled_dot_product_attention — can't be done through the opaque mlx_rs::fast::scaled_dot_product_attention call.

  3. Boundary V — first 2 + last 2 layers get higher V precision, rest compressed aggressively. Recovers 37–91% of quality gap. 15 lines of code, just a per-layer policy decision in the cache.

  4. Block size 128 — matching head_dim gives 5.12× compression (one norm per rotation group, zero redundancy).

Pragmatic path

Start with pure mlx_rs::ops quantized KV cache in KVCache::update() — get the memory savings immediately. ~100-150 lines of Rust. Fused Metal kernels and sparse V can come later.

This is completely independent of the llama.cpp path — they share nothing. ggml manages its own KV cache with C structs and Metal shaders; MLX manages its own with Rust arrays. The only shared thing would be the centroid constants.

michaelneale and others added 6 commits April 3, 2026 08:45
In-process Metal GPU inference via mlx-rs (Rust → mlx-c), no Python/subprocess.
Drop-in replacement for llama-server when model is a safetensors directory.

- mlx/model.rs: Qwen2/Llama transformer, quantized matmul, RoPE, KV cache
- mlx/server.rs: OpenAI-compatible HTTP (streaming SSE + blocking)
- Auto-detect safetensors dirs, skip rpc-server (solo-only)
- cfg(target_os = macos) — zero impact on Linux builds
Without causal masking during prefill, larger models (32B) produced
degenerate output (repeating template tokens). Use MLX's built-in
ScaledDotProductAttentionMask::Causal for multi-token prefill.

Tested: Qwen2.5-32B-Instruct-4bit — correct output, ~23 tok/s.
- EOS detection now uses config.json eos_token_id instead of hardcoded
  token IDs. Fixes premature stop on Qwen (token 2 = '#', not </s>).
- Reverted chunked prefill — concatenation-based KV cache makes it
  worse not better. Need pre-allocated KV cache for real improvement.
Replace concatenation-based KV cache with pre-allocated buffers using
slice assignment (index_mut). Allocates in steps of 256 positions and
grows with concatenation only when needed.

Critical fix: zeros pre-allocation now uses the same dtype as the
incoming KV tensors (typically f16 from quantized matmul), eliminating
f32→f16 type promotion on every attention call.

Prefill chunk size set to 2048 matching mlx-lm's default.

Results on Qwen2.5-32B-Instruct-4bit:
- 2k TTFT: 14.2s → 11.8s (matches mlx-lm reference)
- 8k TTFT: 85.7s → 57.1s (matches mlx-lm reference)
- Decode 200 tokens: 12.5s → 8.7s (~23 tok/s, faster than llama-server)
Keep KV caches + token IDs from the previous request. On the next
request, find the longest common prefix and skip re-prefilling those
tokens — only forward the new suffix.

This is the standard approach used by llama-server (slot reuse),
mlx-lm, and Ollama. Critical for agent workloads where each turn
shares the same system prompt + conversation history.

Results on Qwen2.5-32B-Instruct-4bit, ~1.2k token prompt:
- Cold TTFT: 6.2s
- Warm TTFT: 0.44s (14x faster, only 9 new tokens to prefill)

Also adds KVCache::trim_to() for rewinding cache to a prefix length.
@i386
Copy link
Copy Markdown
Collaborator

i386 commented Apr 2, 2026

Updated after the latest MLX UX, templating, smoke-test, and catalog work:

Risk Recommendation Mitigation
Safetensors auto-detection is still a real decision point, but it is now materially narrower than before. The branch now prefers trusted config metadata and only falls back to safetensors-header inspection when needed. Keep the current approach: prefer HF/config metadata first, use header inspection only as fallback, and fail clearly when a safetensors model is present but not supported by the MLX backend. ✅ Added explicit --mlx-file; detection now prefers config metadata and only falls back to header inspection in mlx/model.rs. Repo shorthand also supports explicit --mlx / --gguf selection when a Hugging Face repo is ambiguous.
MLX bypasses the existing rpc/split path for supported MLX models on macOS. Treat this as an intentional current limitation, not an accidental regression. Keep it documented and track future distributed MLX support separately. See #146. This is by design in the current rollout. Supported MLX models route to the local MLX backend, while GGUF models continue to use the llama.cpp path that supports rpc/split distribution. Follow-up tracked in #146.
MLX serving behavior is not yet feature-parity with llama.cpp. Current path is narrower in areas like chat templating and sampling behavior. Merge with clear scope: position MLX as a supported backend for a subset of models/workflows, not a full behavioral clone yet. Track full parity as future work rather than blocking on it. ✅ This is materially improved: the MLX path now uses Hugging Face chat templates when available via minijinja, falls back cleanly to built-in templates, supports Gemma prompt rendering on the template side, and honors common decoding controls (temperature, top_p, top_k, seed, stop).
In-process MLX server lifecycle is new control-path logic compared with the older subprocess-based llama flow. Keep the current shutdown/test coverage, and add targeted regression tests later around startup/shutdown, reload, and interrupted requests if MLX becomes a maintained backend. ✅ Adapted MLX to the current InferenceServerHandle lifecycle, added explicit shutdown handling, and verified with cargo test -p mesh-llm --lib, just build, unit/request-shape smokes, and the new macOS end-to-end MLX smoke path.

Still deferred intentionally: actual Gemma MLX model loading/runtime support. That work is tracked separately in #142; only Gemma prompt/template rendering support is included in this PR.

@i386 i386 requested a review from Copilot April 3, 2026 07:09
@i386 i386 marked this pull request as ready for review April 3, 2026 07:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a macOS-only native MLX (Metal) inference backend that can serve supported safetensors model directories in-process, integrating with existing runtime/model resolution flows and adding CI smoke coverage for real MLX models.

Changes:

  • Introduces a new mlx backend (server, sampling, HF template rendering) and wires it into runtime election/local serving paths on macOS.
  • Extends model resolution + CLI UX to support MLX artifacts, repo shorthand resolution, and explicit --gguf/--mlx preferences plus --gguf-file/--mlx-file.
  • Adds MLX catalog entries + download sidecar handling, and a macOS CI smoke matrix that runs real inference.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/ci-mlx-smoke-test.sh Adds a macOS MLX end-to-end smoke script (startup, inference, template checks).
mesh-llm/src/runtime/mod.rs Adds CLI validation, MLX file handling, repo shorthand resolution preference, and skips rpc-server for MLX.
mesh-llm/src/runtime/local.rs Normalizes MLX model names and launches MLX in-process server when applicable.
mesh-llm/src/models/resolve.rs Adds repo shorthand resolution with artifact selection + preference (--gguf/--mlx).
mesh-llm/src/models/prompt.rs Adds prompt behavior inference from HF templates / fallback heuristics.
mesh-llm/src/models/mod.rs Exposes prompt behavior + format preference in the models module API.
mesh-llm/src/models/catalog.rs Adds MLX sidecars + safetensors shard discovery for HF downloads.
mesh-llm/src/models/catalog.json Adds MLX catalog entries (Qwen/Llama/Gemma MLX variants).
mesh-llm/src/models/capabilities.rs Extends capability inference to include .jinja template sources.
mesh-llm/src/mlx/* Adds MLX backend implementation: HTTP server, sampling, template rendering, and test corpus.
mesh-llm/src/mesh/mod.rs Improves served model descriptor inference using known model paths.
mesh-llm/src/lib.rs Enables the mlx module on macOS builds.
mesh-llm/src/inference/launch.rs Adds an “in-process” inference server handle for MLX lifecycle shutdown.
mesh-llm/src/inference/election.rs Routes MLX models to the in-process MLX server on macOS.
mesh-llm/src/cli/* Adds new CLI flags and backend preference options for model download/load commands.
mesh-llm/Cargo.toml Adds macOS-only deps for MLX + template rendering.
README.md Documents MLX catalog usage, repo shorthand, and new flags.
MLX_ROADMAP.md Adds roadmap / scope tracking doc for MLX backend work.
AGENTS.md Adds guidance to avoid parallel Rust build/test/format in a worktree.
.github/workflows/ci.yml Adds a macOS MLX CI job matrix running real inference via the smoke script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/ci-mlx-smoke-test.sh Outdated
Comment thread mesh-llm/src/runtime/local.rs Outdated
Comment thread mesh-llm/src/runtime/mod.rs Outdated
Comment thread mesh-llm/src/runtime/mod.rs Outdated
Comment thread mesh-llm/src/mlx/server.rs
Comment thread mesh-llm/src/mlx/server.rs Outdated
Comment thread mesh-llm/src/mlx/server.rs
Comment thread mesh-llm/src/runtime/mod.rs Outdated
i386 and others added 30 commits April 10, 2026 16:12
# Conflicts:
#	scripts/package-release.sh
# Conflicts:
#	mesh-llm/src/runtime/local.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker blocking other PRs experimental

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants