Skip to content

fix(anthropic): prevent cache_control overload with LiteLLM-style proxies#13477

Open
DoubleWhopperS wants to merge 1 commit intoNousResearch:mainfrom
DoubleWhopperS:fix/cache-control-third-party-proxies
Open

fix(anthropic): prevent cache_control overload with LiteLLM-style proxies#13477
DoubleWhopperS wants to merge 1 commit intoNousResearch:mainfrom
DoubleWhopperS:fix/cache-control-third-party-proxies

Conversation

@DoubleWhopperS
Copy link
Copy Markdown

TL;DR

When Hermes uses provider: anthropic + a non-api.anthropic.com base_url (e.g., llm.echo.tech or any self-hosted LiteLLM), every tool-heavy turn deterministically fails with:

HTTP 400 "A maximum of 4 blocks with cache_control may be provided. Found 5"

Root cause: the proxy injects its own cache_control markers server-side before forwarding to Anthropic/Bedrock. Whatever we send from the client stacks on top of what the proxy adds, blowing past the 4-breakpoint limit.

This PR adds two layered defenses:

  1. _strip_all_cache_control in prompt_caching.py — scrubs accumulated markers from session history before reapplying fresh ones each turn (prevents intra-client accumulation across turns).
  2. _cap_cache_control_markers in anthropic_adapter.py, invoked at the end of build_anthropic_kwargs — caps total markers in the outgoing request; strips all of them when _is_third_party_anthropic_endpoint(base_url) is true.

Reproduction

  1. Configure Hermes:
    model:
      default: "claude-sonnet-4-6"
      provider: "anthropic"
      base_url: "https://llm.echo.tech"    # or any LiteLLM proxy
  2. Run any multi-turn task that fires ≥4 consecutive tool calls (e.g., a skill that uses terminal repeatedly).
  3. Observe the HTTP 400 above on turn 5–7 depending on how aggressively the proxy injects.

Debug trail (for reviewers who want the receipts)

Narrowing the root cause took three iterations. Each is concrete — the comparison between client-side marker count and the server's "Found N" number is what pinpointed the proxy as the added party:

phase fix attempted dump markers server says verdict
1 Scrub accumulated markers (_strip_all_cache_control on load) 5 → 4 Found 5 still over — client isn't the only source
2 Cap at end of build_anthropic_kwargs (max_markers=4) 4 Found 5 proxy is adding 1
3 Cap to 3 as stress test 3 Found 5 proxy is adding 2 on some turns
4 Cap to 0 for third-party proxies only 0 (no error) ✅ — pass through no markers, proxy manages its own

Cost: we lose client-side prompt-prefix caching hints on LiteLLM-style endpoints. That's acceptable — LiteLLM proxies typically implement their own server-side prompt caching anyway, and correctness beats an optimization that deterministically 400s.

Native Anthropic and OpenRouter paths keep the existing 4-marker strategy unchanged.

Related

A cleaner long-term fix may be to teach _anthropic_prompt_cache_policy to return (False, False) for third-party LiteLLM-style endpoints — i.e., don't emit markers at all rather than cap them after the fact. Happy to follow up with that approach if the maintainers prefer. This PR keeps the change minimal and scoped to the one call site that actually ships requests.

Test plan

  • Existing Hermes deploy on llm.echo.tech: 8-tool-call skill invocation, completed in 24s with zero HTTP 400. Before this patch, the same skill failed on turn 5.
  • python3 -m py_compile agent/anthropic_adapter.py agent/prompt_caching.py — clean
  • Unit test to be added if maintainers request — the behavior is a post-transform on kwargs and is trivially testable.

…xies

When talking to a third-party Anthropic-compatible endpoint (LiteLLM,
self-hosted proxies, etc.) the proxy injects its own cache_control
markers before forwarding to Anthropic/Bedrock. Whatever the client
sends stacks on top, deterministically tripping the 4-breakpoint limit
with HTTP 400 "A maximum of 4 blocks with cache_control may be provided".

- prompt_caching.py: _strip_all_cache_control scrubs accumulated markers
  from session history before reapplying fresh ones each turn.
- anthropic_adapter.py: _cap_cache_control_markers is invoked at the end
  of build_anthropic_kwargs. For third-party endpoints detected via
  _is_third_party_anthropic_endpoint(base_url), all markers are stripped
  so the proxy manages caching on its side.

Native Anthropic and OpenRouter paths keep the existing 4-marker
strategy unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant