Prompt caching is the fastest LLM cost win available right now. It requires zero changes to your feature code, activates within minutes of configuration, and consistently delivers 50–90% cost reductions on the parts of your prompt that repeat across requests — which, for any mature AI feature, is most of it.
Both OpenAI and Anthropic offer caching natively. They work differently, price differently, and have different gotchas. This guide covers both in full — with real implementation patterns, the failure modes that kill cache hit rates, and the math to know exactly what you'll save.
Key Takeaways
- OpenAI automatic caching: 50–80% off on prompts ≥1,024 tokens — no code changes required
- Anthropic explicit caching: 90% off on cache reads ($0.30/M vs $3.00/M for Claude Sonnet 4.6) — requires
cache_controlheader- ProjectDiscovery boosted their cache hit rate from 7% to 84%, cutting LLM costs by 59–70% (ProjectDiscovery Engineering, 2026)
- Average prompt size grew 4× since early 2024 — from ~1,500 to 6,000 tokens (OpenRouter State of AI, 2025) — making caching more valuable every month
What Is Prompt Caching and Why Does It Matter Now?
Prompt caching stores the processed representation of a prompt prefix — typically the system prompt and any static context — in the model's key-value (KV) cache on the provider's servers. When the next request arrives with the same prefix, the provider skips reprocessing those tokens and returns a cached result at a steep discount.
The reason it matters more in 2026 than it did two years ago: prompts are getting longer. In 2025, OpenRouter's State of AI — an analysis of 100 trillion real API tokens — found that the average prompt token count per request grew nearly 4× between early 2024 and late 2025, from roughly 1,500 tokens to 6,000. Longer system prompts, richer context, larger tool definition blocks — all of them are now substantial enough to benefit from caching.
If your system prompt is 3,000 tokens and you send 5,000 requests per day, you're currently paying to process 15 billion tokens per month on that single static prompt. Caching turns 90% of that into near-zero cost reads.
From our data: The teams seeing the highest cache savings aren't the ones with the most clever optimization — they're the ones with the largest system prompts. A 10,000-token system prompt cached at 80% hit rate saves more per month than a perfectly compressed 500-token prompt at 100% hit rate.
This post is part of our Complete Guide to LLM API Cost Management.
OpenAI Prompt Caching: How It Actually Works
OpenAI's caching is fully automatic — no API parameters, no code changes, no opt-in required. When you send a request with a prompt prefix that matches a recently cached prefix, the system automatically charges the reduced rate.
What gets cached: The initial portion of your prompt — typically the system message and any static context that appears first. The system hashes the first N tokens and checks for a match on a server that already holds those tokens in memory.
Minimum tokens: 1,024 tokens. The cache activates in 128-token increments above that floor. A 900-token system prompt won't cache — a 1,100-token one will.
Pricing:
- GPT-4o: $0.40/M cached input tokens vs $2.00/M standard (80% off)
- GPT-4o-mini: $0.075/M cached vs $0.15/M standard (50% off)
- GPT-4.1 family: $0.50/M cached vs $2.00/M (75% off)
TTL: 5–10 minutes of inactivity; maximum 1 hour. If requests to the same prefix are flowing continuously, the cache stays warm indefinitely.
Latency bonus: Up to 80% latency reduction on cached prefix tokens — the model skips the attention computation entirely for those tokens.
Source: OpenAI Prompt Caching Docs, 2026.
Anthropic Prompt Caching: Explicit Control, Bigger Discount
Anthropic's caching requires an explicit cache_control marker on the content block you want to cache — but the discount is steeper: 90% off on cache reads (Anthropic Claude Docs, February 2026).
How to enable it:
# Python example — Anthropic SDK
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful customer support agent...",
"cache_control": {"type": "ephemeral"} # ← This marks the cache point
}
],
messages=[{"role": "user", "content": user_message}]
)
Minimum tokens by model:
- Claude Sonnet 4.6, Sonnet 4.5, Opus 4.1: 1,024 tokens
- Claude Haiku 3.5: 2,048 tokens
- Claude Opus 4.5–4.8: 4,096 tokens
Prompts below the minimum are silently skipped — no error is thrown. If you're not seeing cache hits, check whether both cache_creation_input_tokens and cache_read_input_tokens in the response are returning 0.
Pricing:
- Cache read: 0.1× base rate → $0.30/M for Claude Sonnet 4.6 (vs $3.00/M standard) — 90% off
- Cache write (5-min TTL): 1.25× base rate
- Cache write (1-hour TTL): 2× base rate
The write cost is slightly more expensive than a standard call, so you need at least 2 cache reads to break even on the write investment. At 5+ reads on the same prefix, it's clearly profitable.
February 2026 change: As of February 5, 2026, cache is isolated at workspace level (not org level) on the Claude API. If you share an Anthropic org across teams, each workspace now has its own cache namespace. Plan your cache architecture accordingly.
How Much Will You Actually Save?
The savings depend on three variables: prompt size, request volume, and cache hit rate. Here's the formula:
Monthly savings = (prompt_tokens / 1,000,000)
× daily_requests × 30
× (standard_rate − cache_rate)
× hit_rate
Example: Claude Sonnet 4.6, 10,000-token system prompt, 2,000 requests/day, 75% hit rate:
(10,000 / 1M) × 2,000 × 30 × ($3.00 − $0.30) × 0.75
= 0.01 × 2,000 × 30 × $2.70 × 0.75
= $1,215/month saved
For a $99/month cost monitoring tool, that's 12× ROI on prompt caching alone.
In 2026, ProjectDiscovery achieved a 59–70% total LLM cost reduction by raising their Anthropic cache hit rate from 7% to 84% — serving 9.8 billion tokens from cache in production.
Citation capsule: In 2026, ProjectDiscovery raised their Anthropic prompt cache hit rate from 7% to 84% by restructuring prompts to move dynamic content after the static cacheable prefix, ultimately serving 9.8 billion tokens from cache and cutting total LLM spend by 59–70% (ProjectDiscovery Engineering, How We Cut LLM Cost With Prompt Caching, 2026).
The 5 Gotchas That Kill Your Cache Hit Rate
Implementing caching is straightforward. Getting a high hit rate is where most teams lose the savings.
1. Dynamic content in the cacheable prefix The most common mistake: including anything that changes per request — timestamps, user names, session IDs — before the cache marker. These invalidate the cache on every single call. Move all dynamic content to the end of the prompt, after the static system prompt and tool definitions.
2. Tool definition changes On Anthropic, any change to your tool definitions (name, description, input schema) invalidates the entire cache for that prefix. If you're doing A/B testing on tool descriptions or deploying tool updates frequently, you'll see cache misses spike. Treat tool definitions as a static artifact — version them, don't change them between requests.
3. Parallel requests before cache is warm
On Anthropic, the cache entry only becomes available after the first response begins streaming. If you send 10 parallel requests to the same prefix before the first one completes, all 10 will miss the cache and you'll pay for 10 cache writes plus 10 full token reads. Pre-warm the cache with a single max_tokens: 0 request before your traffic arrives.
4. The 20-block lookback limit (Anthropic)
Anthropic only scans the last 20 blocks in a message for prior cache entries. In long conversations, if you don't add cache_control breakpoints at multiple places, the oldest context falls out of the lookback window and misses the cache. For conversations longer than 20 turns, add explicit breakpoints every 15 blocks.
5. OpenAI's >15 req/min overflow At high concurrency on the same prefix, OpenAI's routing can split requests across multiple servers. At sustained rates above 15 req/min for the same prefix, some requests hit servers that don't have that prefix cached yet — producing unexpected cache misses. There's no workaround other than spreading load with slight delays or using the batch API.
Frequently Asked Questions
How do I know if my prompts are actually being cached?
On OpenAI, check the usage object in the response for cached_tokens. On Anthropic, look for cache_creation_input_tokens and cache_read_input_tokens in the usage block. If both are 0 on every request, the cache isn't activating — usually because your prompt is below the minimum token threshold or dynamic content is appearing before the cacheable section.
Is prompt caching available on all models?
OpenAI: available on GPT-4o, GPT-4o-mini, GPT-4.1, and o1/o3 series. Not available on GPT-3.5-turbo. Anthropic: available on Claude Sonnet 4.6, Haiku 3.5, and Opus 4.x. Minimum token requirements vary — Opus 4.x requires 4,096 tokens minimum. Always check the current model docs before expecting cache behavior.
Does caching affect response quality?
No. The model processes cached tokens identically to fresh tokens — it simply skips the re-computation of attention for those tokens. The output is mathematically equivalent. Caching is a compute optimization, not an approximation.
What happens to cached prompts when I update my system prompt?
Any change to the cached prefix — even a single character — invalidates the cache completely. There's no partial cache invalidation. Plan system prompt updates as deployment events: update during off-peak hours and expect a brief period of full-price requests while the new prefix warms up.
Can I cache conversation history, not just the system prompt?
Yes. Both providers support caching message history, not just the system message. On Anthropic, add cache_control to message blocks you want to cache. On OpenAI, the prefix matching is automatic — if the beginning of your messages array matches a prior cached request exactly, it caches. This is powerful for document Q&A use cases where you load the same document into context on each query.
The Bottom Line
Prompt caching is the lowest-effort, highest-return optimization available for any LLM feature with a system prompt over 1,024 tokens. OpenAI gives you 50–80% off automatically. Anthropic gives you 90% off if you add one field to your API call.
The gotchas are real but avoidable. Understand them before you configure caching, and you'll hit 60–80% cache rates within a week of deployment.
Track your cache hit rates, write costs, and read costs separately so you know whether the cache is paying off — and by how much. A metering layer like Tokonomics surfaces this per-feature so you can optimize each prompt independently.
Read next: The Complete Guide to LLM API Cost Management for the full optimization playbook.
Sources: OpenAI Prompt Caching Docs | Anthropic Prompt Caching Docs | ProjectDiscovery Engineering | OpenRouter State of AI 2025 | arXiv: Don't Break the Cache
All sources retrieved June 2026.
About the authors: Written by the engineering team behind Tokonomics. About → | Contact us →