feat: prompt prefix caching with TTL eviction and TurboQuant support#995
feat: prompt prefix caching with TTL eviction and TurboQuant support#995eloe wants to merge 7 commits intoBlaizzy:mainfrom
Conversation
Wire the existing PromptCacheState from generate.py into both /v1/responses and /chat/completions endpoints. On repeated requests, the KV cache from the previous generation is reused for matching prefix tokens, skipping redundant prefill computation. This is especially impactful for agentic workflows where the system prompt (~15K tokens) is the same across requests — only new user messages need prefilling, reducing latency from ~35s to ~2-3s on follow-up turns. Changes: - Import PromptCacheState from generate.py - Add get_prompt_cache_state() keyed by model name - Pass prompt_cache_state to all 4 generate/stream_generate call sites - Clear prompt cache on model unload Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prompt cache prefix reuse code assumed all cache layers have mx.array keys with .shape. TurboQuantKVCache stores keys as TurboQuantMSEState objects which don't support slicing. Now checks for .shape before attempting to trim, and falls back to updating just the offset for quantized cache layers. Fixes: 'TurboQuantMSEState' object has no attribute 'shape' error when prompt caching is used with --kv-bits 3.5 --kv-quant-scheme turboquant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_tokens Improve prompt prefix caching to work with all KV cache types: - Use trim(n) method for prefix reuse instead of manual array slicing. Works with both standard KVCache and TurboQuantKVCache. - Accept optional cache_key parameter for per-session cache routing (supports OpenClaw prompt_cache_key and Hermes patterns). - Add cached_tokens field to GenerationResult, populated from reused_prefix_len for cache hit reporting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tches Two fixes for prompt cache stability: 1. Require >= 50% prefix match (min 512 tokens) before reusing KV cache. Short matches on quantized caches (TurboQuant) produce corrupted repetitive output because trim() only adjusts offset without clearing stale quantized data. 2. Skip cache save for requests < 1024 tokens. Agent frameworks send short probe/capability-check requests that would evict the valuable cached system prompt KV state. Also uses trim() method for cache trimming (TurboQuant compatible) instead of manual array slicing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prompt cache fixes: - Wrap cache reuse in try/except — if trim() fails (e.g., broadcast_shapes from stale KV state), invalidate cache and fall back to fresh generation instead of crashing - Add cache shape validation before generate_step — detect seq length mismatches early and rebuild cache - Add PromptCacheState.invalidate() method to clear stale state Error handling: - Sanitize all error messages sent to clients — no more raw MLX errors like "[broadcast_shapes] Shapes (12716) and (1840) cannot be broadcast" leaking through to Telegram/API users - Streaming and non-streaming paths both sanitized - Full errors still logged server-side for debugging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PromptCacheState now tracks last_used and created_at timestamps. A background asyncio task runs every 60s and evicts entries idle longer than --prompt-cache-ttl (default 300s, configurable via CLI or PROMPT_CACHE_TTL env var). TTL=0 disables expiry. This prevents stale KV caches from holding GPU memory indefinitely (e.g., a 45K-context conversation cache sitting unused for hours). Eviction triggers gc.collect() + mx.clear_cache() to reclaim VRAM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simulates actual failure scenarios: - 13-hour idle Telegram conversation (the broadcast_shapes bug) - Active conversation (30s between messages, never evicted) - Multiple users with different cache keys (only stale evicted) - TTL boundary (299s idle on 300s TTL — not evicted) - Invalidation verification (cache/token_ids set to None) - Short TTL for dev/testing (5s evicts after 10s idle) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-part patch that makes prompt prefix caching work end-to-end on mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on back-to-back agent turns and ~38x speedup on direct chat/completions. Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py + server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff. apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run precheck and an idempotency guard (grep for _prompt_cache_states). Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects the TurboQuant group size (1024), not the actual token offset, causing every cache-reuse attempt to fail and silently defeating the cache. Fix is a ~6-line inline replacement: trust c.offset first, fall back to shape-based comparison only when offset is unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant' makes it idempotent. Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.
|
+1 from production use — this PR is doing real work for us. A huge thank-you to @eloe for pushing it upstream, and one bug report attached below that I think is worth folding in before merge. Empirical resultWe're running the With this PR applied (plus the TurboQuant shape-check fix described below), measured on two back-to-back identical agent turns through our gateway:
~24× speedup on end-to-end agent turns. For direct Without this PR, none of that is possible on our hardware. The agent-use case on a single-machine deployment becomes actually viable with prefix caching where before it was drowning in prefill time. This is exactly the kind of thing a server for local agent use needs, and I hope it lands. Bug report — TurboQuant shape-check incompatibility
if reused_prefix_len > 0:
try:
for c in kwargs["prompt_cache"]:
if hasattr(c, "keys") and c.keys is not None:
expected_seq = reused_prefix_len
actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else c.offset
if actual_seq != expected_seq:
raise ValueError(
f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
)For unquantized KV caches this is correct: For TurboQuant-quantized KV caches, however, The Suggested fixTrust if reused_prefix_len > 0:
try:
for c in kwargs["prompt_cache"]:
if hasattr(c, "keys") and c.keys is not None:
expected_seq = reused_prefix_len
if hasattr(c, "offset") and c.offset is not None:
actual_seq = c.offset
else:
actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else 0
if actual_seq != expected_seq:
raise ValueError(
f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
)That's the ~6-line change we're carrying locally. We went from 0 × speedup (with the upstream check) to 24 × speedup (with Happy to open a follow-up PR against your branch with the one-line fix if that's useful, or you might prefer to fold it in here since it's a one-character conceptual change and the commit already advertises TurboQuant support. Whatever's easier for you to land. ContextWe carry both this PR's full runtime diff and the TurboQuant shape-fix as local patches, vendored in-tree with a dedicated README and retirement plan. Happy to share the full Again — huge thanks for this PR. This is going to unblock a lot of single-machine agent stacks. |
Three documentation updates covering the 2026-04-11 round of fixes (read tool, weather skill, session maintenance, and the prompt-cache win from Blaizzy/mlx-vlm#995 + local TurboQuant fix). ## CHANGELOG.md New sections under [Unreleased]: - **Added — prompt prefix caching (2026-04-11):** vendored Blaizzy/mlx-vlm#995 diff at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff, plus our TurboQuant shape-check fix, plus the measured 24x end-to-end and ~100x prefill-rate speedups. Links to the upstream +1 comment at Blaizzy/mlx-vlm#995 (comment). - **Added — service-restored Telegram alert (2026-04-11):** the fail→pass transition path in health-check.sh. - **Fixed — skill reliability round 2 (2026-04-11):** the full session-bloat-masquerading-as-read-tool-bug diagnosis with the four co-dependent fixes (session.maintenance tuning, agent-exec.sh fresh-session default, weather HTTPS override, earlier prompt cache work). - **Verified (2026-04-11) block** with measured durations for read / weather / prompt cache / allowlist / service-restored. ## docs/dress-rehearsal.md New troubleshooting section "Tool calls always 'fail' with no error detail" at the top of the Troubleshooting chapter. Walks through the session-bloat diagnosis with copy-pasteable diagnostic commands (ls, wc, grep for Metal OOM in the err log) and the manual recovery snippet (drop the main session entry from sessions.json, archive the bloated transcript, restart gateway). Points to commit 4f92a81 for the durable fix. ## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md New file documenting exactly which committed file is the source of truth for each of the five fixes in the cluster, how they flow through the next reset-local-state.sh + install.sh + wizard.py rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the TurboQuant fix actually merge upstream. Also lists things that deliberately WON'T survive a rebuild (hot-patched live openclaw.json, workspace skill copies, ephemeral seed file) and why that's fine. Retirement trail table maps each upstream PR to the specific apply-mlx-patches.sh function to drop when it merges. The check_upstream_mlx_prs() drift-check from update.sh (63753b6) covers the monitoring side.
Two-part patch that makes prompt prefix caching work end-to-end on mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on back-to-back agent turns and ~38x speedup on direct chat/completions. Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py + server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff. apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run precheck and an idempotency guard (grep for _prompt_cache_states). Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects the TurboQuant group size (1024), not the actual token offset, causing every cache-reuse attempt to fail and silently defeating the cache. Fix is a ~6-line inline replacement: trust c.offset first, fall back to shape-based comparison only when offset is unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant' makes it idempotent. Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.
Three documentation updates covering the 2026-04-11 round of fixes (read tool, weather skill, session maintenance, and the prompt-cache win from Blaizzy/mlx-vlm#995 + local TurboQuant fix). ## CHANGELOG.md New sections under [Unreleased]: - **Added — prompt prefix caching (2026-04-11):** vendored Blaizzy/mlx-vlm#995 diff at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff, plus our TurboQuant shape-check fix, plus the measured 24x end-to-end and ~100x prefill-rate speedups. Links to the upstream +1 comment at Blaizzy/mlx-vlm#995 (comment). - **Added — service-restored Telegram alert (2026-04-11):** the fail→pass transition path in health-check.sh. - **Fixed — skill reliability round 2 (2026-04-11):** the full session-bloat-masquerading-as-read-tool-bug diagnosis with the four co-dependent fixes (session.maintenance tuning, agent-exec.sh fresh-session default, weather HTTPS override, earlier prompt cache work). - **Verified (2026-04-11) block** with measured durations for read / weather / prompt cache / allowlist / service-restored. ## docs/dress-rehearsal.md New troubleshooting section "Tool calls always 'fail' with no error detail" at the top of the Troubleshooting chapter. Walks through the session-bloat diagnosis with copy-pasteable diagnostic commands (ls, wc, grep for Metal OOM in the err log) and the manual recovery snippet (drop the main session entry from sessions.json, archive the bloated transcript, restart gateway). Points to commit 4f989c7 for the durable fix. ## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md New file documenting exactly which committed file is the source of truth for each of the five fixes in the cluster, how they flow through the next reset-local-state.sh + install.sh + wizard.py rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the TurboQuant fix actually merge upstream. Also lists things that deliberately WON'T survive a rebuild (hot-patched live openclaw.json, workspace skill copies, ephemeral seed file) and why that's fine. Retirement trail table maps each upstream PR to the specific apply-mlx-patches.sh function to drop when it merges. The check_upstream_mlx_prs() drift-check from update.sh (53c78f8) covers the monitoring side.
Summary
Adds cross-request KV cache reuse to the server, significantly reducing time-to-first-token for multi-turn conversations and agentic workflows where the system prompt is stable across turns.
How it works
find_prefix_length()computes the common prefix with the cached stateprompt_cache_keyfor multi-session routingFeatures
cache.trim()instead of raw array slicing, works withQuantizedKVCache,RotatingKVCache, andArraysCache(hybrid models like Qwen3.5)--prompt-cache-ttlseconds (default 300s), freeing GPU memoryprompt_cache_keyfield enables per-session cache isolation (useful for multi-user gateways)usage.input_tokens_details.cached_tokensin responsesConfiguration
--prompt-cache-ttlPROMPT_CACHE_TTLTests
9 tests covering cache state lifecycle, persistence, isolation, prefix matching, and cleanup:
Related
This builds on the same direction as #946 by @trevorgordon981 (basic PromptCacheState wiring) and #944 by @damonvjanis (TurboQuant trim fix). This PR includes both of those fixes plus TTL eviction, LRU capping, per-session cache key routing, cached_tokens reporting, graceful stale-cache recovery, and 9 tests.