Skip to content

feat: prompt prefix caching with TTL eviction and TurboQuant support#995

Open
eloe wants to merge 7 commits intoBlaizzy:mainfrom
eloe:upstream/prompt-cache
Open

feat: prompt prefix caching with TTL eviction and TurboQuant support#995
eloe wants to merge 7 commits intoBlaizzy:mainfrom
eloe:upstream/prompt-cache

Conversation

@eloe
Copy link
Copy Markdown

@eloe eloe commented Apr 9, 2026

Summary

Adds cross-request KV cache reuse to the server, significantly reducing time-to-first-token for multi-turn conversations and agentic workflows where the system prompt is stable across turns.

How it works

  1. PromptCacheState tracks token IDs + KV cache between requests
  2. On each request, find_prefix_length() computes the common prefix with the cached state
  3. The KV cache is trimmed to the prefix length and reused, skipping re-computation of shared tokens
  4. Cache entries are keyed per model (default) or per prompt_cache_key for multi-session routing

Features

  • TurboQuant compatible: Uses cache.trim() instead of raw array slicing, works with QuantizedKVCache, RotatingKVCache, and ArraysCache (hybrid models like Qwen3.5)
  • TTL eviction: Background asyncio task evicts stale entries after --prompt-cache-ttl seconds (default 300s), freeing GPU memory
  • LRU cap: Max 64 cache entries with least-recently-used eviction to prevent unbounded memory growth
  • Cache key routing: Optional prompt_cache_key field enables per-session cache isolation (useful for multi-user gateways)
  • cached_tokens reporting: usage.input_tokens_details.cached_tokens in responses
  • Graceful recovery: Cache mismatches (shape errors, stale state) are caught and fall back to fresh generation

Configuration

Flag Env var Default Description
--prompt-cache-ttl PROMPT_CACHE_TTL 300 Seconds before idle cache eviction. 0 = no expiry

Tests

9 tests covering cache state lifecycle, persistence, isolation, prefix matching, and cleanup:

python -m pytest mlx_vlm/tests/ -k "PromptCache" -v

Related

This builds on the same direction as #946 by @trevorgordon981 (basic PromptCacheState wiring) and #944 by @damonvjanis (TurboQuant trim fix). This PR includes both of those fixes plus TTL eviction, LRU capping, per-session cache key routing, cached_tokens reporting, graceful stale-cache recovery, and 9 tests.

eloe and others added 7 commits April 8, 2026 20:32
Wire the existing PromptCacheState from generate.py into both
/v1/responses and /chat/completions endpoints. On repeated requests,
the KV cache from the previous generation is reused for matching
prefix tokens, skipping redundant prefill computation.

This is especially impactful for agentic workflows where the system
prompt (~15K tokens) is the same across requests — only new user
messages need prefilling, reducing latency from ~35s to ~2-3s on
follow-up turns.

Changes:
- Import PromptCacheState from generate.py
- Add get_prompt_cache_state() keyed by model name
- Pass prompt_cache_state to all 4 generate/stream_generate call sites
- Clear prompt cache on model unload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prompt cache prefix reuse code assumed all cache layers have
mx.array keys with .shape. TurboQuantKVCache stores keys as
TurboQuantMSEState objects which don't support slicing.

Now checks for .shape before attempting to trim, and falls back
to updating just the offset for quantized cache layers.

Fixes: 'TurboQuantMSEState' object has no attribute 'shape' error
when prompt caching is used with --kv-bits 3.5 --kv-quant-scheme turboquant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_tokens

Improve prompt prefix caching to work with all KV cache types:

- Use trim(n) method for prefix reuse instead of manual array slicing.
  Works with both standard KVCache and TurboQuantKVCache.
- Accept optional cache_key parameter for per-session cache routing
  (supports OpenClaw prompt_cache_key and Hermes patterns).
- Add cached_tokens field to GenerationResult, populated from
  reused_prefix_len for cache hit reporting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tches

Two fixes for prompt cache stability:

1. Require >= 50% prefix match (min 512 tokens) before reusing KV cache.
   Short matches on quantized caches (TurboQuant) produce corrupted
   repetitive output because trim() only adjusts offset without clearing
   stale quantized data.

2. Skip cache save for requests < 1024 tokens. Agent frameworks send
   short probe/capability-check requests that would evict the valuable
   cached system prompt KV state.

Also uses trim() method for cache trimming (TurboQuant compatible)
instead of manual array slicing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prompt cache fixes:
- Wrap cache reuse in try/except — if trim() fails (e.g.,
  broadcast_shapes from stale KV state), invalidate cache and
  fall back to fresh generation instead of crashing
- Add cache shape validation before generate_step — detect
  seq length mismatches early and rebuild cache
- Add PromptCacheState.invalidate() method to clear stale state

Error handling:
- Sanitize all error messages sent to clients — no more raw
  MLX errors like "[broadcast_shapes] Shapes (12716) and (1840)
  cannot be broadcast" leaking through to Telegram/API users
- Streaming and non-streaming paths both sanitized
- Full errors still logged server-side for debugging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PromptCacheState now tracks last_used and created_at timestamps.
A background asyncio task runs every 60s and evicts entries idle
longer than --prompt-cache-ttl (default 300s, configurable via
CLI or PROMPT_CACHE_TTL env var). TTL=0 disables expiry.

This prevents stale KV caches from holding GPU memory indefinitely
(e.g., a 45K-context conversation cache sitting unused for hours).
Eviction triggers gc.collect() + mx.clear_cache() to reclaim VRAM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simulates actual failure scenarios:
- 13-hour idle Telegram conversation (the broadcast_shapes bug)
- Active conversation (30s between messages, never evicted)
- Multiple users with different cache keys (only stale evicted)
- TTL boundary (299s idle on 300s TTL — not evicted)
- Invalidation verification (cache/token_ids set to None)
- Short TTL for dev/testing (5s evicts after 10s idle)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gofastercloud added a commit to gofastercloud/yabby that referenced this pull request Apr 11, 2026
Two-part patch that makes prompt prefix caching work end-to-end on
mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on
back-to-back agent turns and ~38x speedup on direct chat/completions.

Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py +
server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff.
apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run
precheck and an idempotency guard (grep for _prompt_cache_states).

Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects
the TurboQuant group size (1024), not the actual token offset,
causing every cache-reuse attempt to fail and silently defeating
the cache. Fix is a ~6-line inline replacement: trust c.offset
first, fall back to shape-based comparison only when offset is
unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant'
makes it idempotent.

Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.
@gofastercloud
Copy link
Copy Markdown

+1 from production use — this PR is doing real work for us. A huge thank-you to @eloe for pushing it upstream, and one bug report attached below that I think is worth folding in before merge.

Empirical result

We're running the mlx_vlm.server 0.4.4 + --kv-bits 3 --kv-quant-scheme turboquant configuration on an M4 Pro 24 GB Mac Mini, serving mlx-community/gemma-4-26b-a4b-it-4bit as the backend for a local AI agent stack with a ~21 k-token system prompt per turn. Before this PR, every agent turn re-prefilled the full system prompt at ~500 tok/s → ~42 s of prefill before any useful work. Five tool-call round-trips in a single task was 3.5 minutes of pure prefill, routinely blowing the agent's turn timeout.

With this PR applied (plus the TurboQuant shape-check fix described below), measured on two back-to-back identical agent turns through our gateway:

run prompt size duration prefill rate
turn 1 (cold) 25 906 tok 144 708 ms (~2:24) ~500 tok/s
turn 2 (warm) 25 906 tok 5 949 ms (~5.9 s) cache hit (no prefill)

~24× speedup on end-to-end agent turns. For direct /v1/chat/completions (non-streaming) the numbers are even cleaner: 13.9 s cold → 0.36 s warm on a 6 647-token prompt with the same prompt_cache_key. The prefill progress bars vanish from mlx-vlm.log on warm calls — the server skips straight to decode.

Without this PR, none of that is possible on our hardware. The agent-use case on a single-machine deployment becomes actually viable with prefix caching where before it was drowning in prefill time. This is exactly the kind of thing a server for local agent use needs, and I hope it lands.

Bug report — TurboQuant shape-check incompatibility

generate.py around L780 (post-patch), inside the cache validation that guards against stale KV state:

if reused_prefix_len > 0:
    try:
        for c in kwargs["prompt_cache"]:
            if hasattr(c, "keys") and c.keys is not None:
                expected_seq = reused_prefix_len
                actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else c.offset
                if actual_seq != expected_seq:
                    raise ValueError(
                        f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
                    )

For unquantized KV caches this is correct: c.keys is (heads, kv_heads, seq_len, head_dim) and shape[2] equals the token count, so the comparison is meaningful.

For TurboQuant-quantized KV caches, however, c.keys has a grouped/chunked layout where shape[2] reflects the group size (typically 1024), not the actual token offset. The check therefore fails on every call for quantized caches — even when the cache is byte-identical to what was just stored — and falls into the invalidate + rebuild branch. The net effect is that prompt caching looks functionally correct on any prose test (the code paths all run) but delivers zero measured speedup on a TurboQuant-configured server, with logs like this on every call:

[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6182, got 1024
[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6200, got 1024
[prompt_cache] Cache validation failed, rebuilding: Cache shape mismatch: expected seq=6218, got 1024

The got 1024 is the dead giveaway — that's the quant group size, not the offset.

Suggested fix

Trust c.offset first when present (it's the authoritative counter for both unquantized and quantized caches under the existing MLX cache abstractions), and only fall back to shape-based comparison when offset is unavailable:

if reused_prefix_len > 0:
    try:
        for c in kwargs["prompt_cache"]:
            if hasattr(c, "keys") and c.keys is not None:
                expected_seq = reused_prefix_len
                if hasattr(c, "offset") and c.offset is not None:
                    actual_seq = c.offset
                else:
                    actual_seq = c.keys.shape[2] if len(c.keys.shape) >= 3 else 0
                if actual_seq != expected_seq:
                    raise ValueError(
                        f"Cache shape mismatch: expected seq={expected_seq}, got {actual_seq}"
                    )

That's the ~6-line change we're carrying locally. We went from 0 × speedup (with the upstream check) to 24 × speedup (with c.offset first) on an otherwise-identical setup. The non-TurboQuant path is unchanged because c.offset is already set on those caches too and points to the same value that shape[2] would have.

Happy to open a follow-up PR against your branch with the one-line fix if that's useful, or you might prefer to fold it in here since it's a one-character conceptual change and the commit already advertises TurboQuant support. Whatever's easier for you to land.

Context

We carry both this PR's full runtime diff and the TurboQuant shape-fix as local patches, vendored in-tree with a dedicated README and retirement plan. Happy to share the full apply-mlx-patches.sh harness if anyone else is running into the same TurboQuant issue — the patch file will be at docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/ once we push it public.

Again — huge thanks for this PR. This is going to unblock a lot of single-machine agent stacks.

gofastercloud added a commit to gofastercloud/yabby that referenced this pull request Apr 11, 2026
Three documentation updates covering the 2026-04-11 round of fixes
(read tool, weather skill, session maintenance, and the prompt-cache
win from Blaizzy/mlx-vlm#995 + local TurboQuant fix).

## CHANGELOG.md

New sections under [Unreleased]:

- **Added — prompt prefix caching (2026-04-11):** vendored
  Blaizzy/mlx-vlm#995 diff at
  docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff,
  plus our TurboQuant shape-check fix, plus the measured 24x
  end-to-end and ~100x prefill-rate speedups. Links to the
  upstream +1 comment at
  Blaizzy/mlx-vlm#995 (comment).
- **Added — service-restored Telegram alert (2026-04-11):** the
  fail→pass transition path in health-check.sh.
- **Fixed — skill reliability round 2 (2026-04-11):** the full
  session-bloat-masquerading-as-read-tool-bug diagnosis with the
  four co-dependent fixes (session.maintenance tuning,
  agent-exec.sh fresh-session default, weather HTTPS override,
  earlier prompt cache work).
- **Verified (2026-04-11) block** with measured durations for
  read / weather / prompt cache / allowlist / service-restored.

## docs/dress-rehearsal.md

New troubleshooting section "Tool calls always 'fail' with no error
detail" at the top of the Troubleshooting chapter. Walks through the
session-bloat diagnosis with copy-pasteable diagnostic commands (ls,
wc, grep for Metal OOM in the err log) and the manual recovery snippet
(drop the main session entry from sessions.json, archive the bloated
transcript, restart gateway). Points to commit 4f92a81 for the
durable fix.

## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md

New file documenting exactly which committed file is the source of
truth for each of the five fixes in the cluster, how they flow
through the next reset-local-state.sh + install.sh + wizard.py
rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the
TurboQuant fix actually merge upstream. Also lists things that
deliberately WON'T survive a rebuild (hot-patched live openclaw.json,
workspace skill copies, ephemeral seed file) and why that's fine.

Retirement trail table maps each upstream PR to the specific
apply-mlx-patches.sh function to drop when it merges. The
check_upstream_mlx_prs() drift-check from update.sh (63753b6)
covers the monitoring side.
gofastercloud added a commit to gofastercloud/yabby that referenced this pull request Apr 11, 2026
Two-part patch that makes prompt prefix caching work end-to-end on
mlx_vlm.server with TurboQuant KV. Measured ~24x speedup on
back-to-back agent turns and ~38x speedup on direct chat/completions.

Part 1 vendors Blaizzy/mlx-vlm#995 runtime diff (generate.py +
server.py, skipping tests) under docs/upstream-prs/03-.../pr995.diff.
apply-mlx-patches.sh applies via 'patch -p1' with a --dry-run
precheck and an idempotency guard (grep for _prompt_cache_states).

Part 2 fixes PR #995's shape-check guard: c.keys.shape[2] reflects
the TurboQuant group size (1024), not the actual token offset,
causing every cache-reuse attempt to fail and silently defeating
the cache. Fix is a ~6-line inline replacement: trust c.offset
first, fall back to shape-based comparison only when offset is
unset. Marker 'LOCAL PATCH (OpenDeclawed, 2026-04-11): TurboQuant'
makes it idempotent.

Full write-up in docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/README.md.
gofastercloud added a commit to gofastercloud/yabby that referenced this pull request Apr 11, 2026
Three documentation updates covering the 2026-04-11 round of fixes
(read tool, weather skill, session maintenance, and the prompt-cache
win from Blaizzy/mlx-vlm#995 + local TurboQuant fix).

## CHANGELOG.md

New sections under [Unreleased]:

- **Added — prompt prefix caching (2026-04-11):** vendored
  Blaizzy/mlx-vlm#995 diff at
  docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/pr995.diff,
  plus our TurboQuant shape-check fix, plus the measured 24x
  end-to-end and ~100x prefill-rate speedups. Links to the
  upstream +1 comment at
  Blaizzy/mlx-vlm#995 (comment).
- **Added — service-restored Telegram alert (2026-04-11):** the
  fail→pass transition path in health-check.sh.
- **Fixed — skill reliability round 2 (2026-04-11):** the full
  session-bloat-masquerading-as-read-tool-bug diagnosis with the
  four co-dependent fixes (session.maintenance tuning,
  agent-exec.sh fresh-session default, weather HTTPS override,
  earlier prompt cache work).
- **Verified (2026-04-11) block** with measured durations for
  read / weather / prompt cache / allowlist / service-restored.

## docs/dress-rehearsal.md

New troubleshooting section "Tool calls always 'fail' with no error
detail" at the top of the Troubleshooting chapter. Walks through the
session-bloat diagnosis with copy-pasteable diagnostic commands (ls,
wc, grep for Metal OOM in the err log) and the manual recovery snippet
(drop the main session entry from sessions.json, archive the bloated
transcript, restart gateway). Points to commit 4f989c7 for the
durable fix.

## docs/upstream-prs/03-mlx-vlm-prompt-cache-turboquant/DURABILITY.md

New file documenting exactly which committed file is the source of
truth for each of the five fixes in the cluster, how they flow
through the next reset-local-state.sh + install.sh + wizard.py
rebuild, and what to retire when Blaizzy/mlx-vlm#995 + the
TurboQuant fix actually merge upstream. Also lists things that
deliberately WON'T survive a rebuild (hot-patched live openclaw.json,
workspace skill copies, ephemeral seed file) and why that's fine.

Retirement trail table maps each upstream PR to the specific
apply-mlx-patches.sh function to drop when it merges. The
check_upstream_mlx_prs() drift-check from update.sh (53c78f8)
covers the monitoring side.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants