Native MLX inference server for Apple Silicon by michaelneale · Pull Request #103 · Mesh-LLM/mesh-llm

michaelneale · 2026-03-31T06:07:05Z

In-process Metal GPU inference via mlx-rs for Apple Silicon. This branch is ported onto the current main tree and now includes the follow-up runtime, UX, template, download, catalog, documentation, and smoke-test work needed to make MLX practical to use.

What

adds a macOS-only mlx backend under mesh-llm/src/mlx/
detects supported MLX safetensors model directories and serves them in-process instead of launching llama-server
skips rpc-server for MLX-backed local serving paths
integrates MLX with the current runtime, election, and local model-loading flow
uses the current inference-server lifecycle model so in-process MLX servers shut down cleanly

Added Since The Initial Port

adds explicit raw file flags:
- --gguf-file
- --mlx-file
adds backend preference flags for ambiguous Hugging Face repos:
- --gguf
- --mlx
now requires explicit --mlx for MLX runtime selection from --model
prints a one-time experimental startup warning for MLX and points users at the GitHub issues page
supports Hugging Face repo shorthand for both runtime load and download when the repo has one clear primary artifact
downloads MLX sidecars automatically for MLX snapshot refs:
- config.json
- tokenizer.json
- tokenizer_config.json when present
- chat_template.json when present
- chat_template.jinja when present
- shard files from model.safetensors.index.json
adds explicit MLX catalog entries for supported model families
broadens HF template compatibility for real Qwen 3, Qwen 3 coder, Kimi, Llama, Gemma, Mistral, GLM, GPT-OSS, and related MLX repos
fixes MLX token streaming/decoding for split UTF-8 byte fallback output
refreshes README.md and MLX_ROADMAP.md to match current MLX behavior and scope

CLI Examples

# Use an MLX catalog model
mesh-llm --model Qwen3-4B-MLX --mlx
mesh-llm --model Gemma-2-2B-it-MLX --mlx
mesh-llm --model GLM-4-9B-0414-MLX --mlx
mesh-llm --model LFM2-350M-MLX --mlx

# Download an MLX catalog model
mesh-llm models download Qwen3-4B-MLX
mesh-llm models download LFM2-350M-MLX

# Load an exact Hugging Face MLX artifact
mesh-llm --model mlx-community/Qwen2.5-0.5B-Instruct-4bit/model.safetensors.index.json --mlx

# Load from an unambiguous Hugging Face repo shorthand
mesh-llm --model mlx-community/Qwen2.5-0.5B-Instruct-4bit --mlx

# Download from an unambiguous Hugging Face repo shorthand
mesh-llm models download mlx-community/Qwen2.5-0.5B-Instruct-4bit

# Force GGUF or MLX when a repo has multiple candidates
mesh-llm --model some-org/some-repo --gguf
mesh-llm --model some-org/some-repo --mlx

# Raw local files remain explicit
mesh-llm --gguf-file ~/models/custom.gguf
mesh-llm --mlx-file ~/models/qwen-mlx/model.safetensors

Compatibility Matrix

Family / capability	GGUF / llama backend	Native MLX backend
Llama text	Supported	Supported
Qwen 2.x text	Supported	Supported
Qwen 3.x text	Supported	Supported
Gemma 2 text	Supported	Supported
Gemma 3 text	Supported	Supported
Gemma 4 text	Supported	Supported
GLM4 text	Supported	Supported
LFM2 text	Not a current GGUF target in this PR	Supported
DeepSeekV3 / Kimi K2 text	Supported where GGUF/runtime path exists	Supported in native MLX runtime base; compile/test verified, not live-smoked in CI because public MLX repos are very large
Kimi Linear	Not claimed here	Supported in native MLX runtime; compile/test verified, not live-smoked yet
gpt-oss	Not claimed here	Supported in native MLX runtime; compile/test verified, not live-smoked yet
Vision	Supported on llama-backed path	Not supported yet
Audio	Not a first-class runtime path today	Not supported
Distributed / split serving	Supported on llama-backed path	Not supported yet; tracked in #146

Prompt / Serving Improvements

uses Hugging Face chat templates when available for MLX models
renders HF templates with minijinja, with fallback to built-in templates if compile/render fails
adds Gemma prompt-template support on the rendering side
improves compatibility for real .jinja-backed Qwen 3 and Qwen 3 coder templates
adds reasoning-control support across Qwen3, GLM, Kimi, gpt-oss, and LFM2 template families
adds family-aware MLX reasoning-output handling so supported reasoning families default to direct answers when thinking is off
fixes streaming token decode for split UTF-8 / byte-fallback output in the MLX server path
honors common decoding controls in the MLX server path:
- temperature
- top_p
- top_k
- seed
- stop
adds clearer logging around HF template load vs fallback behavior

Testing / Smoke Coverage

adds a macOS MLX smoke matrix in CI covering:
- Qwen 2.5
- Qwen 3
- JOSIE / Qwen 3
- Llama 3.2
- Gemma 2
- Gemma 3
- Gemma 4
- GLM4
- LFM2
adds a matching macOS GGUF smoke matrix for the comparable llama-backed families where practical
adds prompt-suite smoke coverage so both MLX and GGUF paths get multiple trivial-output probes rather than a single happy-path prompt
adds unit and request-shape smoke coverage for:
- Llama-style prompts
- Qwen-style prompts
- Gemma-style prompts
- tool-aware HF chat templates
- split UTF-8 streaming decode
- MLX reasoning-output handling
adds a real HF template corpus covering 15 MLX repos
macOS CI jobs are gated behind ENABLE_MACOS_CI=true so repos without mac runners do not queue forever

Risks / Mitigations

Risk	Recommendation	Mitigation
Safetensors auto-detection is still a real decision point, but it is now materially narrower than before. The branch now prefers trusted config metadata and only falls back to safetensors-header inspection when needed.	Keep the current approach: prefer HF/config metadata first, use header inspection only as fallback, and fail clearly when a safetensors model is present but not supported by the MLX backend.	✅ Added explicit `--mlx-file`; detection now prefers config metadata and only falls back to header inspection in `mlx/model.rs`. Repo shorthand also supports explicit `--mlx` / `--gguf` selection when a Hugging Face repo is ambiguous.
MLX bypasses the existing rpc/split path for supported MLX models on macOS.	Treat this as an intentional current limitation, not an accidental regression. Keep it documented and track future distributed MLX support separately. See #146.	✅ This is by design in the current rollout. Supported MLX models route to the local MLX backend, while GGUF models continue to use the llama.cpp path that supports rpc/split distribution. Follow-up tracked in #146.
MLX serving behavior is not yet feature-parity with llama.cpp.	Merge with clear scope: position MLX as a supported backend for a subset of models/workflows, not a full behavioral clone yet. Track full parity as future work rather than blocking on it.	✅ This now covers Llama, Qwen2, Qwen3, Gemma2/3/4, GLM4, and LFM2 text serving, plus Hugging Face chat templates, reasoning-aware template and response controls, and streaming-safe token decoding.
Some larger MLX families are now compile/runtime-supported but not practical to live-smoke in CI because the public repos are too large or not yet operationally reasonable for the current matrix.	Be explicit about the difference between compile/test validation and live-smoke validation. Add real live validation later when smaller practical targets exist or dedicated hardware is available.	✅ DeepSeekV3 / Kimi-K2, gpt-oss, and Kimi Linear are covered by model tests, but not yet part of the live smoke matrix.
In-process MLX server lifecycle is new control-path logic compared with the older subprocess-based llama flow.	Keep the current shutdown/test coverage, and add targeted regression tests later around startup/shutdown, reload, and interrupted requests if MLX becomes a maintained backend.	✅ Adapted MLX to the current `InferenceServerHandle` lifecycle, added explicit shutdown handling, and verified with `cargo test -p mesh-llm --lib`, `just build`, unit/request-shape smokes, live local MLX smokes, and the macOS end-to-end MLX/GGUF smoke matrices.
MLX could be chosen accidentally when a user really intended the more mature GGUF path.	Make MLX explicit and set expectations clearly when users opt in.	✅ MLX runtime selection now requires explicit `--mlx` (or `--mlx-file`) and prints an experimental warning that points users at the issues page.

Follow-up Fixes Included

causal mask fix for prefill on larger models
EOS token loading from config.json
pre-allocated KV cache with dtype fixes
prompt KV cache reuse between requests
support for Gemma 2 / 3 / 4 text, GLM4 text, LFM2 text, gpt-oss, and Kimi Linear in the native MLX runtime
explicit MLX runtime opt-in and updated documentation

Deferred

broader live validation for larger MLX families (DeepSeekV3 / Kimi-K2, gpt-oss, Kimi Linear)
MLX vision support
distributed / split MLX serving, tracked in Support split/distributed serving for MLX models #146

Verification

cargo test -p mesh-llm --lib
cargo test -p mesh-llm --lib models::resolve::tests -- --nocapture
cargo test -p mesh-llm --lib runtime::tests -- --nocapture
just build
live local MLX smokes for:
- mlx-community/Qwen2.5-0.5B-Instruct-4bit
- mlx-community/Qwen3-0.6B-4bit
- mlx-community/JOSIE-IT1-Qwen3-0.6B-4bit
- mlx-community/Llama-3.2-1B-Instruct-4bit
- mlx-community/gemma-2-2b-it-4bit
- mlx-community/gemma-3-1b-it-qat-4bit
- mlx-community/GLM-4-9B-0414-4bit
- mlx-community/LFM2-350M-4bit
- unsloth/gemma-4-E4B-it-UD-MLX-4bit

Notes

guarded by cfg(target_os = "macos"), so non-macOS builds keep the existing backend behavior
GGUF models continue to use the existing llama.cpp path
supported MLX models on macOS use the local-only MLX path today
proxy routing stays unchanged; MLX serves the same local OpenAI-compatible HTTP surface used by the existing runtime
MLX_ROADMAP.md now tracks the remaining family/runtime gaps and parity work

i386 · 2026-03-31T06:35:13Z

I will align #56 to better suit this PR - I abstracted out backends a little too early.

michaelneale · 2026-03-31T06:41:11Z

Notes requested by mic: TurboQuant KV cache compression for the MLX path

Crawled TheTom/turboquant_plus — an implementation of TurboQuant (ICLR 2026) KV cache compression for llama.cpp. The core findings are directly applicable to our MLX inference path since we own the full KV cache lifecycle.

Why this matters

At 32K context, KV cache is multiple GB. TurboQuant achieves 3.8–5.1× compression with ~1% PPL impact. That means 4× the context length or 4× concurrent users from the same Mac.

The algorithm is simple

PolarQuant (the production variant — they dropped QJL, the 1-bit error correction, because it hurts quality through softmax amplification):

WHT rotation — Walsh-Hadamard butterfly + random sign flips. Makes any distribution Gaussian. O(d log d). ~40 lines.
Extract norm — store ||v|| per block, normalize.
Quantize — snap each rotated coordinate to nearest centroid from a precomputed table (8 entries for 3-bit, 16 for 4-bit). Pack indices into bytes.
Dequant — unpack, lookup centroids, apply norm correction (rescale so reconstructed vector has original norm), inverse WHT.

The centroids are just constants — Lloyd-Max optimal values for a Gaussian distribution. Literally a [f32; 8] or [f32; 16].

Where it plugs in

The current KVCache struct stores Option<Array> (full precision, concatenated every token). The compression slots into update():

// Current: store raw
self.keys = Some(new_k.clone());

// With TurboQuant: quantize on store, dequant on read
// Store compressed representation (indices + norms) instead of full arrays
// Dequant before returning to scaled_dot_product_attention

All the ops are expressible through mlx_rs::ops:

WHT butterfly → sequence of add/subtract on array slices (or a small custom Metal kernel)
Sign flips → elementwise multiply with precomputed ±1 vector
Centroid lookup → take op on a tiny table
Norm extraction → mlx_rs::ops::sqrt + sum of squares

Key findings from TurboQuant research

Asymmetric K/V is important — K precision dominates quality (softmax amplifies K errors exponentially). V tolerates aggressive compression (errors scale linearly). Best config: keep K at higher precision, compress V harder.
Sparse V dequant — skip V dequantization when attention weight < 1e-6. At 32K context, 90%+ of weights are negligible. +22.8% decode speedup, zero quality cost. This would require either a custom attention implementation or a hook into scaled_dot_product_attention — can't be done through the opaque mlx_rs::fast::scaled_dot_product_attention call.
Boundary V — first 2 + last 2 layers get higher V precision, rest compressed aggressively. Recovers 37–91% of quality gap. 15 lines of code, just a per-layer policy decision in the cache.
Block size 128 — matching head_dim gives 5.12× compression (one norm per rotation group, zero redundancy).

Pragmatic path

Start with pure mlx_rs::ops quantized KV cache in KVCache::update() — get the memory savings immediately. ~100-150 lines of Rust. Fused Metal kernels and sparse V can come later.

This is completely independent of the llama.cpp path — they share nothing. ggml manages its own KV cache with C structs and Metal shaders; MLX manages its own with Rust arrays. The only shared thing would be the centroid constants.

In-process Metal GPU inference via mlx-rs (Rust → mlx-c), no Python/subprocess. Drop-in replacement for llama-server when model is a safetensors directory. - mlx/model.rs: Qwen2/Llama transformer, quantized matmul, RoPE, KV cache - mlx/server.rs: OpenAI-compatible HTTP (streaming SSE + blocking) - Auto-detect safetensors dirs, skip rpc-server (solo-only) - cfg(target_os = macos) — zero impact on Linux builds

Without causal masking during prefill, larger models (32B) produced degenerate output (repeating template tokens). Use MLX's built-in ScaledDotProductAttentionMask::Causal for multi-token prefill. Tested: Qwen2.5-32B-Instruct-4bit — correct output, ~23 tok/s.

- EOS detection now uses config.json eos_token_id instead of hardcoded token IDs. Fixes premature stop on Qwen (token 2 = '#', not </s>). - Reverted chunked prefill — concatenation-based KV cache makes it worse not better. Need pre-allocated KV cache for real improvement.

Replace concatenation-based KV cache with pre-allocated buffers using slice assignment (index_mut). Allocates in steps of 256 positions and grows with concatenation only when needed. Critical fix: zeros pre-allocation now uses the same dtype as the incoming KV tensors (typically f16 from quantized matmul), eliminating f32→f16 type promotion on every attention call. Prefill chunk size set to 2048 matching mlx-lm's default. Results on Qwen2.5-32B-Instruct-4bit: - 2k TTFT: 14.2s → 11.8s (matches mlx-lm reference) - 8k TTFT: 85.7s → 57.1s (matches mlx-lm reference) - Decode 200 tokens: 12.5s → 8.7s (~23 tok/s, faster than llama-server)

Keep KV caches + token IDs from the previous request. On the next request, find the longest common prefix and skip re-prefilling those tokens — only forward the new suffix. This is the standard approach used by llama-server (slot reuse), mlx-lm, and Ollama. Critical for agent workloads where each turn shares the same system prompt + conversation history. Results on Qwen2.5-32B-Instruct-4bit, ~1.2k token prompt: - Cold TTFT: 6.2s - Warm TTFT: 0.44s (14x faster, only 9 new tokens to prefill) Also adds KVCache::trim_to() for rewinding cache to a prefix length.

i386 · 2026-04-02T22:47:28Z

Updated after the latest MLX UX, templating, smoke-test, and catalog work:

Risk	Recommendation	Mitigation
Safetensors auto-detection is still a real decision point, but it is now materially narrower than before. The branch now prefers trusted config metadata and only falls back to safetensors-header inspection when needed.	Keep the current approach: prefer HF/config metadata first, use header inspection only as fallback, and fail clearly when a safetensors model is present but not supported by the MLX backend.	✅ Added explicit `--mlx-file`; detection now prefers config metadata and only falls back to header inspection in `mlx/model.rs`. Repo shorthand also supports explicit `--mlx` / `--gguf` selection when a Hugging Face repo is ambiguous.
MLX bypasses the existing rpc/split path for supported MLX models on macOS.	Treat this as an intentional current limitation, not an accidental regression. Keep it documented and track future distributed MLX support separately. See #146.	This is by design in the current rollout. Supported MLX models route to the local MLX backend, while GGUF models continue to use the llama.cpp path that supports rpc/split distribution. Follow-up tracked in #146.
MLX serving behavior is not yet feature-parity with llama.cpp. Current path is narrower in areas like chat templating and sampling behavior.	Merge with clear scope: position MLX as a supported backend for a subset of models/workflows, not a full behavioral clone yet. Track full parity as future work rather than blocking on it.	✅ This is materially improved: the MLX path now uses Hugging Face chat templates when available via `minijinja`, falls back cleanly to built-in templates, supports Gemma prompt rendering on the template side, and honors common decoding controls (`temperature`, `top_p`, `top_k`, `seed`, `stop`).
In-process MLX server lifecycle is new control-path logic compared with the older subprocess-based llama flow.	Keep the current shutdown/test coverage, and add targeted regression tests later around startup/shutdown, reload, and interrupted requests if MLX becomes a maintained backend.	✅ Adapted MLX to the current `InferenceServerHandle` lifecycle, added explicit shutdown handling, and verified with `cargo test -p mesh-llm --lib`, `just build`, unit/request-shape smokes, and the new macOS end-to-end MLX smoke path.

Still deferred intentionally: actual Gemma MLX model loading/runtime support. That work is tracked separately in #142; only Gemma prompt/template rendering support is included in this PR.

…n-issue-119

Copilot

Pull request overview

Adds a macOS-only native MLX (Metal) inference backend that can serve supported safetensors model directories in-process, integrating with existing runtime/model resolution flows and adding CI smoke coverage for real MLX models.

Changes:

Introduces a new mlx backend (server, sampling, HF template rendering) and wires it into runtime election/local serving paths on macOS.
Extends model resolution + CLI UX to support MLX artifacts, repo shorthand resolution, and explicit --gguf/--mlx preferences plus --gguf-file/--mlx-file.
Adds MLX catalog entries + download sidecar handling, and a macOS CI smoke matrix that runs real inference.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
scripts/ci-mlx-smoke-test.sh	Adds a macOS MLX end-to-end smoke script (startup, inference, template checks).
mesh-llm/src/runtime/mod.rs	Adds CLI validation, MLX file handling, repo shorthand resolution preference, and skips rpc-server for MLX.
mesh-llm/src/runtime/local.rs	Normalizes MLX model names and launches MLX in-process server when applicable.
mesh-llm/src/models/resolve.rs	Adds repo shorthand resolution with artifact selection + preference (`--gguf/--mlx`).
mesh-llm/src/models/prompt.rs	Adds prompt behavior inference from HF templates / fallback heuristics.
mesh-llm/src/models/mod.rs	Exposes prompt behavior + format preference in the models module API.
mesh-llm/src/models/catalog.rs	Adds MLX sidecars + safetensors shard discovery for HF downloads.
mesh-llm/src/models/catalog.json	Adds MLX catalog entries (Qwen/Llama/Gemma MLX variants).
mesh-llm/src/models/capabilities.rs	Extends capability inference to include `.jinja` template sources.
mesh-llm/src/mlx/*	Adds MLX backend implementation: HTTP server, sampling, template rendering, and test corpus.
mesh-llm/src/mesh/mod.rs	Improves served model descriptor inference using known model paths.
mesh-llm/src/lib.rs	Enables the `mlx` module on macOS builds.
mesh-llm/src/inference/launch.rs	Adds an “in-process” inference server handle for MLX lifecycle shutdown.
mesh-llm/src/inference/election.rs	Routes MLX models to the in-process MLX server on macOS.
mesh-llm/src/cli/*	Adds new CLI flags and backend preference options for model download/load commands.
mesh-llm/Cargo.toml	Adds macOS-only deps for MLX + template rendering.
README.md	Documents MLX catalog usage, repo shorthand, and new flags.
MLX_ROADMAP.md	Adds roadmap / scope tracking doc for MLX backend work.
AGENTS.md	Adds guidance to avoid parallel Rust build/test/format in a worktree.
.github/workflows/ci.yml	Adds a macOS MLX CI job matrix running real inference via the smoke script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Conflicts: # scripts/package-release.sh

# Conflicts: # mesh-llm/src/runtime/local.rs

michaelneale mentioned this pull request Mar 31, 2026

Add local MLX backend support and benchmark tooling #79

Closed

Copilot AI mentioned this pull request Apr 2, 2026

[WIP] Make PR 103 mergable on top of PR 127 #129

Closed

9 tasks

michaelneale and others added 6 commits April 3, 2026 08:45

fix: remove mlx server warning

e30d87d

i386 force-pushed the mlx-rs-native branch from cd7e8b3 to e30d87d Compare April 2, 2026 22:39

i386 added 14 commits April 3, 2026 10:00

refine mlx model selection

1bd1df0

fix: gate mlx cli handling on macos

3ec979a

feat: improve mlx prompt and sampling behavior

68f04ea

feat: improve mlx model resolution and templating

8885f5a

Improve MLX HF template compatibility tests

4384b84

feat: expand mlx hf template compatibility

cdff07c

test: expand mlx qwen template coverage

8b43297

feat: improve mlx qwen3 template rendering

747bceb

fix: allow missing optional mlx sidecars

b05c40e

fix: support qwen3 attention norms in mlx

5112def

test: add macos mlx smoke matrix

94c58ba

fix: default qwen3 mlx templates to no thinking

8456616

Merge remote-tracking branch 'origin/main' into codex/mlx-rs-native-o…

6a45380

…n-issue-119

feat: expand mlx runtime support and coverage

552bad8

i386 requested a review from Copilot April 3, 2026 07:09

i386 marked this pull request as ready for review April 3, 2026 07:09

Copilot started reviewing on behalf of i386 April 3, 2026 07:10 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

i386 and others added 30 commits April 10, 2026 16:12

scripts: remove orphan smoke helpers and clean plugin clippy lint

8b6bac6

chore: remove stale release job graph artifact

1c6f636

Document MLX family bring-up workflow

b650c31

fix(warm-caches): fix the syntax issue for env in the warm job

2e871b2

feature(smoke-test): add inference smoke test as synced CI flow

6403443

workflows: align ci/release definitions with main

5eecb5f

ci: add optional GGUF/MLX matrix smoke jobs

c896364

Merge origin/main into mlx-rs-native

962c871

fmt: normalize lib.rs trailing newline

bcff8ea

Merge remote-tracking branch 'origin/main' into mlx-rs-native

c6e5192

ci: include llama-moe-analyze in inference smoke artifacts

4839b52

Improve MLX Mistral parity and split family loaders

2babac0

Update validation scripts for current model flags

4cf2554

Publish Mistral parity artifacts in validation matrix

b4b48ea

Accept OLMo parity artifacts in validation matrix

6ffb53d

fix(runtime): add backend_hint in startup spec test

98a40cc

Harden MLX template coverage for shipped families

6e04478

Harden MLX config coverage for shipped families

241366b

Harden MLX family transforms and cache invariants

c016056

Fix MLX behavior regressions and finalize matrix status

362f9d1

Fix MLX validation regressions and refresh matrix coverage

a34aa57

Refresh validation baselines and artifact paths

db72bd7

Clarify tested MLX families in README

cbd18f3

Merge origin/main into mlx-rs-native

06048df

Decompose MLX model loading and family assembly

b0cd011

Fix macOS CI test regressions

d18444c

Clarify MLX startup logs

ba16004

Merge remote-tracking branch 'origin/main' into mlx-rs-native

25b92fd

Merge remote-tracking branch 'origin/main' into mlx-rs-native

e43da62

# Conflicts: # scripts/package-release.sh

Merge remote-tracking branch 'origin/main' into mlx-rs-native

c32f1a3

# Conflicts: # mesh-llm/src/runtime/local.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native MLX inference server for Apple Silicon#103

Native MLX inference server for Apple Silicon#103
michaelneale wants to merge 196 commits intomainfrom
mlx-rs-native

michaelneale commented Mar 31, 2026 •

edited by i386

Loading

Uh oh!

i386 commented Mar 31, 2026

Uh oh!

michaelneale commented Mar 31, 2026

Uh oh!

i386 commented Apr 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

michaelneale commented Mar 31, 2026 • edited by i386 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Added Since The Initial Port

CLI Examples

Compatibility Matrix

Prompt / Serving Improvements

Testing / Smoke Coverage

Risks / Mitigations

Follow-up Fixes Included

Deferred

Verification

Notes

Uh oh!

i386 commented Mar 31, 2026

Uh oh!

michaelneale commented Mar 31, 2026

Notes requested by mic: TurboQuant KV cache compression for the MLX path

Why this matters

The algorithm is simple

Where it plugs in

Key findings from TurboQuant research

Pragmatic path

Uh oh!

i386 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michaelneale commented Mar 31, 2026 •

edited by i386

Loading

i386 commented Apr 2, 2026 •

edited

Loading