feat: prompt prefix caching with TTL eviction and TurboQuant support by eloe · Pull Request #995 · Blaizzy/mlx-vlm

eloe · 2026-04-09T03:48:45Z

Summary

Adds cross-request KV cache reuse to the server, significantly reducing time-to-first-token for multi-turn conversations and agentic workflows where the system prompt is stable across turns.

How it works

PromptCacheState tracks token IDs + KV cache between requests
On each request, find_prefix_length() computes the common prefix with the cached state
The KV cache is trimmed to the prefix length and reused, skipping re-computation of shared tokens
Cache entries are keyed per model (default) or per prompt_cache_key for multi-session routing

Features

TurboQuant compatible: Uses cache.trim() instead of raw array slicing, works with QuantizedKVCache, RotatingKVCache, and ArraysCache (hybrid models like Qwen3.5)
TTL eviction: Background asyncio task evicts stale entries after --prompt-cache-ttl seconds (default 300s), freeing GPU memory
LRU cap: Max 64 cache entries with least-recently-used eviction to prevent unbounded memory growth
Cache key routing: Optional prompt_cache_key field enables per-session cache isolation (useful for multi-user gateways)
cached_tokens reporting: usage.input_tokens_details.cached_tokens in responses
Graceful recovery: Cache mismatches (shape errors, stale state) are caught and fall back to fresh generation

Configuration

Flag	Env var	Default	Description
`--prompt-cache-ttl`	`PROMPT_CACHE_TTL`	300	Seconds before idle cache eviction. 0 = no expiry

Tests

9 tests covering cache state lifecycle, persistence, isolation, prefix matching, and cleanup:

python -m pytest mlx_vlm/tests/ -k "PromptCache" -v

This builds on the same direction as #946 by @trevorgordon981 (basic PromptCacheState wiring) and #944 by @damonvjanis (TurboQuant trim fix). This PR includes both of those fixes plus TTL eviction, LRU capping, per-session cache key routing, cached_tokens reporting, graceful stale-cache recovery, and 9 tests.

Wire the existing PromptCacheState from generate.py into both /v1/responses and /chat/completions endpoints. On repeated requests, the KV cache from the previous generation is reused for matching prefix tokens, skipping redundant prefill computation. This is especially impactful for agentic workflows where the system prompt (~15K tokens) is the same across requests — only new user messages need prefilling, reducing latency from ~35s to ~2-3s on follow-up turns. Changes: - Import PromptCacheState from generate.py - Add get_prompt_cache_state() keyed by model name - Pass prompt_cache_state to all 4 generate/stream_generate call sites - Clear prompt cache on model unload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The prompt cache prefix reuse code assumed all cache layers have mx.array keys with .shape. TurboQuantKVCache stores keys as TurboQuantMSEState objects which don't support slicing. Now checks for .shape before attempting to trim, and falls back to updating just the offset for quantized cache layers. Fixes: 'TurboQuantMSEState' object has no attribute 'shape' error when prompt caching is used with --kv-bits 3.5 --kv-quant-scheme turboquant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_tokens Improve prompt prefix caching to work with all KV cache types: - Use trim(n) method for prefix reuse instead of manual array slicing. Works with both standard KVCache and TurboQuantKVCache. - Accept optional cache_key parameter for per-session cache routing (supports OpenClaw prompt_cache_key and Hermes patterns). - Add cached_tokens field to GenerationResult, populated from reused_prefix_len for cache hit reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tches Two fixes for prompt cache stability: 1. Require >= 50% prefix match (min 512 tokens) before reusing KV cache. Short matches on quantized caches (TurboQuant) produce corrupted repetitive output because trim() only adjusts offset without clearing stale quantized data. 2. Skip cache save for requests < 1024 tokens. Agent frameworks send short probe/capability-check requests that would evict the valuable cached system prompt KV state. Also uses trim() method for cache trimming (TurboQuant compatible) instead of manual array slicing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prompt cache fixes: - Wrap cache reuse in try/except — if trim() fails (e.g., broadcast_shapes from stale KV state), invalidate cache and fall back to fresh generation instead of crashing - Add cache shape validation before generate_step — detect seq length mismatches early and rebuild cache - Add PromptCacheState.invalidate() method to clear stale state Error handling: - Sanitize all error messages sent to clients — no more raw MLX errors like "[broadcast_shapes] Shapes (12716) and (1840) cannot be broadcast" leaking through to Telegram/API users - Streaming and non-streaming paths both sanitized - Full errors still logged server-side for debugging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PromptCacheState now tracks last_used and created_at timestamps. A background asyncio task runs every 60s and evicts entries idle longer than --prompt-cache-ttl (default 300s, configurable via CLI or PROMPT_CACHE_TTL env var). TTL=0 disables expiry. This prevents stale KV caches from holding GPU memory indefinitely (e.g., a 45K-context conversation cache sitting unused for hours). Eviction triggers gc.collect() + mx.clear_cache() to reclaim VRAM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Simulates actual failure scenarios: - 13-hour idle Telegram conversation (the broadcast_shapes bug) - Active conversation (30s between messages, never evicted) - Multiple users with different cache keys (only stale evicted) - TTL boundary (299s idle on 300s TTL — not evicted) - Invalidation verification (cache/token_ids set to None) - Short TTL for dev/testing (5s evicts after 10s idle) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-part patch that makes prompt prefix caching work end-to-end on mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on back-to-back agent turns and ~38x speedup on direct chat/completions. Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py + server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff. apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run precheck and an idempotency guard (grep for _prompt_cache_states). Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects the TurboQuant group size (1024), not the actual token offset, causing every cache-reuse attempt to fail and silently defeating the cache. Fix is a ~6-line inline replacement: trust c.offset first, fall back to shape-based comparison only when offset is unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant' makes it idempotent. Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.

gofastercloud · 2026-04-11T05:27:17Z

+1 from production use — this PR is doing real work for us. A huge thank-you to @eloe for pushing it upstream, and one bug report attached below that I think is worth folding in before merge.

Empirical result

We're running the mlx_vlm.server 0.4.4 + --kv-bits 3 --kv-quant-scheme turboquant configuration on an M4 Pro 24 GB Mac Mini, serving mlx-community/gemma-4-26b-a4b-it-4bit as the backend for a local AI agent stack with a ~21 k-token system prompt per turn. Before this PR, every agent turn re-prefilled the full system prompt at ~500 tok/s → ~42 s of prefill before any useful work. Five tool-call round-trips in a single task was 3.5 minutes of pure prefill, routinely blowing the agent's turn timeout.

With this PR applied (plus the TurboQuant shape-check fix described below), measured on two back-to-back identical agent turns through our gateway:

run	prompt size	duration	prefill rate
turn 1 (cold)	25 906 tok	144 708 ms (~2:24)	~500 tok/s
turn 2 (warm)	25 906 tok	5 949 ms (~5.9 s)	cache hit (no prefill)

~24× speedup on end-to-end agent turns. For direct /v1/chat/completions (non-streaming) the numbers are even cleaner: 13.9 s cold → 0.36 s warm on a 6 647-token prompt with the same prompt_cache_key. The prefill progress bars vanish from mlx-vlm.log on warm calls — the server skips straight to decode.

Without this PR, none of that is possible on our hardware. The agent-use case on a single-machine deployment becomes actually viable with prefix caching where before it was drowning in prefill time. This is exactly the kind of thing a server for local agent use needs, and I hope it lands.

Bug report — TurboQuant shape-check incompatibility

generate.py around L780 (post-patch), inside the cache validation that guards against stale KV state:

if reused_prefix_len > 0:
    try:
        for c in kwargs["prompt_cache"]:
            if hasattr(c, "keys") and c.keys is not None:
                expected_seq = reused_prefix_len
                actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else c.offset
                if actual_seq != expected_seq:
                    raise ValueError(
                        f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
                    )

For unquantized KV caches this is correct: c.keys is (heads, kv_heads, seq_len, head_dim) and shape[2] equals the token count, so the comparison is meaningful.

For TurboQuant-quantized KV caches, however, c.keys has a grouped/chunked layout where shape[2] reflects the group size (typically 1024), not the actual token offset. The check therefore fails on every call for quantized caches — even when the cache is byte-identical to what was just stored — and falls into the invalidate + rebuild branch. The net effect is that prompt caching looks functionally correct on any prose test (the code paths all run) but delivers zero measured speedup on a TurboQuant-configured server, with logs like this on every call:

[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6182, got 1024
[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6200, got 1024
[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6218, got 1024

The got 1024 is the dead giveaway — that's the quant group size, not the offset.

Suggested fix

Trust c.offset first when present (it's the authoritative counter for both unquantized and quantized caches under the existing MLX cache abstractions), and only fall back to shape-based comparison when offset is unavailable:

if reused_prefix_len > 0:
    try:
        for c in kwargs["prompt_cache"]:
            if hasattr(c, "keys") and c.keys is not None:
                expected_seq = reused_prefix_len
                if hasattr(c, "offset") and c.offset is not None:
                    actual_seq = c.offset
                else:
                    actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else 0
                if actual_seq != expected_seq:
                    raise ValueError(
                        f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
                    )

That's the ~6-line change we're carrying locally. We went from 0 × speedup (with the upstream check) to 24 × speedup (with c.offset first) on an otherwise-identical setup. The non-TurboQuant path is unchanged because c.offset is already set on those caches too and points to the same value that shape[2] would have.

Happy to open a follow-up PR against your branch with the one-line fix if that's useful, or you might prefer to fold it in here since it's a one-character conceptual change and the commit already advertises TurboQuant support. Whatever's easier for you to land.

Context

We carry both this PR's full runtime diff and the TurboQuant shape-fix as local patches, vendored in-tree with a dedicated README and retirement plan. Happy to share the full apply-mlx-patches.sh harness if anyone else is running into the same TurboQuant issue — the patch file will be at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/ once we push it public.

Again — huge thanks for this PR. This is going to unblock a lot of single-machine agent stacks.

Three documentation updates covering the 2026-04-11 round of fixes (read tool, weather skill, session maintenance, and the prompt-cache win from Blaizzy/mlx-vlm#995 + local TurboQuant fix). ## CHANGELOG.md New sections under [Unreleased]: - **Added — prompt prefix caching (2026-04-11):** vendored Blaizzy/mlx-vlm#995 diff at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff, plus our TurboQuant shape-check fix, plus the measured 24x end-to-end and ~100x prefill-rate speedups. Links to the upstream +1 comment at Blaizzy/mlx-vlm#995 (comment). - **Added — service-restored Telegram alert (2026-04-11):** the fail→pass transition path in health-check.sh. - **Fixed — skill reliability round 2 (2026-04-11):** the full session-bloat-masquerading-as-read-tool-bug diagnosis with the four co-dependent fixes (session.maintenance tuning, agent-exec.sh fresh-session default, weather HTTPS override, earlier prompt cache work). - **Verified (2026-04-11) block** with measured durations for read / weather / prompt cache / allowlist / service-restored. ## docs/dress-rehearsal.md New troubleshooting section "Tool calls always 'fail' with no error detail" at the top of the Troubleshooting chapter. Walks through the session-bloat diagnosis with copy-pasteable diagnostic commands (ls, wc, grep for Metal OOM in the err log) and the manual recovery snippet (drop the main session entry from sessions.json, archive the bloated transcript, restart gateway). Points to commit 4f92a81 for the durable fix. ## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md New file documenting exactly which committed file is the source of truth for each of the five fixes in the cluster, how they flow through the next reset-local-state.sh + install.sh + wizard.py rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the TurboQuant fix actually merge upstream. Also lists things that deliberately WON'T survive a rebuild (hot-patched live openclaw.json, workspace skill copies, ephemeral seed file) and why that's fine. Retirement trail table maps each upstream PR to the specific apply-mlx-patches.sh function to drop when it merges. The check_upstream_mlx_prs() drift-check from update.sh (63753b6) covers the monitoring side.

Two-part patch that makes prompt prefix caching work end-to-end on mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on back-to-back agent turns and ~38x speedup on direct chat/completions. Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py + server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff. apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run precheck and an idempotency guard (grep for _prompt_cache_states). Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects the TurboQuant group size (1024), not the actual token offset, causing every cache-reuse attempt to fail and silently defeating the cache. Fix is a ~6-line inline replacement: trust c.offset first, fall back to shape-based comparison only when offset is unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant' makes it idempotent. Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.

Three documentation updates covering the 2026-04-11 round of fixes (read tool, weather skill, session maintenance, and the prompt-cache win from Blaizzy/mlx-vlm#995 + local TurboQuant fix). ## CHANGELOG.md New sections under [Unreleased]: - **Added — prompt prefix caching (2026-04-11):** vendored Blaizzy/mlx-vlm#995 diff at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff, plus our TurboQuant shape-check fix, plus the measured 24x end-to-end and ~100x prefill-rate speedups. Links to the upstream +1 comment at Blaizzy/mlx-vlm#995 (comment). - **Added — service-restored Telegram alert (2026-04-11):** the fail→pass transition path in health-check.sh. - **Fixed — skill reliability round 2 (2026-04-11):** the full session-bloat-masquerading-as-read-tool-bug diagnosis with the four co-dependent fixes (session.maintenance tuning, agent-exec.sh fresh-session default, weather HTTPS override, earlier prompt cache work). - **Verified (2026-04-11) block** with measured durations for read / weather / prompt cache / allowlist / service-restored. ## docs/dress-rehearsal.md New troubleshooting section "Tool calls always 'fail' with no error detail" at the top of the Troubleshooting chapter. Walks through the session-bloat diagnosis with copy-pasteable diagnostic commands (ls, wc, grep for Metal OOM in the err log) and the manual recovery snippet (drop the main session entry from sessions.json, archive the bloated transcript, restart gateway). Points to commit 4f989c7 for the durable fix. ## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md New file documenting exactly which committed file is the source of truth for each of the five fixes in the cluster, how they flow through the next reset-local-state.sh + install.sh + wizard.py rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the TurboQuant fix actually merge upstream. Also lists things that deliberately WON'T survive a rebuild (hot-patched live openclaw.json, workspace skill copies, ephemeral seed file) and why that's fine. Retirement trail table maps each upstream PR to the specific apply-mlx-patches.sh function to drop when it merges. The check_upstream_mlx_prs() drift-check from update.sh (53c78f8) covers the monitoring side.

eloe and others added 7 commits April 8, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: prompt prefix caching with TTL eviction and TurboQuant support#995

feat: prompt prefix caching with TTL eviction and TurboQuant support#995
eloe wants to merge 7 commits intoBlaizzy:mainfrom
eloe:upstream/prompt-cache

eloe commented Apr 9, 2026

Uh oh!

gofastercloud commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

eloe commented Apr 9, 2026

Summary

How it works

Features

Configuration

Tests

Related

Uh oh!

gofastercloud commented Apr 11, 2026

Empirical result

Bug report — TurboQuant shape-check incompatibility

Suggested fix

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants