fix(anthropic): prevent cache_control overload with LiteLLM-style proxies#13477
Open
DoubleWhopperS wants to merge 1 commit intoNousResearch:mainfrom
Open
fix(anthropic): prevent cache_control overload with LiteLLM-style proxies#13477DoubleWhopperS wants to merge 1 commit intoNousResearch:mainfrom
DoubleWhopperS wants to merge 1 commit intoNousResearch:mainfrom
Conversation
…xies When talking to a third-party Anthropic-compatible endpoint (LiteLLM, self-hosted proxies, etc.) the proxy injects its own cache_control markers before forwarding to Anthropic/Bedrock. Whatever the client sends stacks on top, deterministically tripping the 4-breakpoint limit with HTTP 400 "A maximum of 4 blocks with cache_control may be provided". - prompt_caching.py: _strip_all_cache_control scrubs accumulated markers from session history before reapplying fresh ones each turn. - anthropic_adapter.py: _cap_cache_control_markers is invoked at the end of build_anthropic_kwargs. For third-party endpoints detected via _is_third_party_anthropic_endpoint(base_url), all markers are stripped so the proxy manages caching on its side. Native Anthropic and OpenRouter paths keep the existing 4-marker strategy unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
When Hermes uses
provider: anthropic+ a non-api.anthropic.combase_url(e.g.,llm.echo.techor any self-hosted LiteLLM), every tool-heavy turn deterministically fails with:Root cause: the proxy injects its own
cache_controlmarkers server-side before forwarding to Anthropic/Bedrock. Whatever we send from the client stacks on top of what the proxy adds, blowing past the 4-breakpoint limit.This PR adds two layered defenses:
_strip_all_cache_controlinprompt_caching.py— scrubs accumulated markers from session history before reapplying fresh ones each turn (prevents intra-client accumulation across turns)._cap_cache_control_markersinanthropic_adapter.py, invoked at the end ofbuild_anthropic_kwargs— caps total markers in the outgoing request; strips all of them when_is_third_party_anthropic_endpoint(base_url)is true.Reproduction
terminalrepeatedly).Debug trail (for reviewers who want the receipts)
Narrowing the root cause took three iterations. Each is concrete — the comparison between client-side marker count and the server's "Found N" number is what pinpointed the proxy as the added party:
_strip_all_cache_controlon load)build_anthropic_kwargs(max_markers=4)Cost: we lose client-side prompt-prefix caching hints on LiteLLM-style endpoints. That's acceptable — LiteLLM proxies typically implement their own server-side prompt caching anyway, and correctness beats an optimization that deterministically 400s.
Native Anthropic and OpenRouter paths keep the existing 4-marker strategy unchanged.
Related
A cleaner long-term fix may be to teach
_anthropic_prompt_cache_policyto return(False, False)for third-party LiteLLM-style endpoints — i.e., don't emit markers at all rather than cap them after the fact. Happy to follow up with that approach if the maintainers prefer. This PR keeps the change minimal and scoped to the one call site that actually ships requests.Test plan
llm.echo.tech: 8-tool-call skill invocation, completed in 24s with zero HTTP 400. Before this patch, the same skill failed on turn 5.python3 -m py_compile agent/anthropic_adapter.py agent/prompt_caching.py— cleankwargsand is trivially testable.