← All articles
PricingInfrastructurePlaybook

LLM Prompt Caching in 2026: The Setup, the Math, and Three Ways It Quietly Fails

LLM Prompt Caching in 2026: The Setup, the Math, and Three Ways It Quietly Fails

The marketing line on prompt caching is simple: 90% off your input tokens on cache hits. Every major provider — Anthropic, OpenAI, Google — advertises this number, and on the right workload it's literally true. The reality on most production workloads is more nuanced. The 90% is the discount on cached reads; it isn't the discount on your bill. The actual reduction depends on what fraction of your prompt is reusable, how often the same prefix is hit within the cache window, whether your cached content meets the minimum token threshold, and a handful of provider-specific gotchas that don't show up on the pricing page. For workloads that do most things right, caching delivers 50–80% reduction in input cost without changing models, evals, or routing. For workloads that get the structure wrong, it delivers nothing — or worse, costs more than not caching at all because the cache writes are billed at a premium and the reads never happen. This article walks through what the math actually looks like with real workloads, the four patterns that produce high hit rates in production, and the three failure modes that quietly kill the savings.

What caching is, and the breakeven math

The mechanic is the same across all three major providers, with implementation differences that matter at the margin. The provider stores the computed key-value tensors of a prompt prefix in fast-access memory; on subsequent requests that share the same prefix byte-for-byte, the model skips the prefill computation and reads the precomputed state. The customer-facing result is that the cached portion of the input is billed at a fraction of the standard input rate. The mechanism behind that fraction is real GPU compute being saved on the provider's side; the savings are passed through as the cache discount.

Anthropic's pricing structure is the most explicit about the trade-offs. A cache write — the first time the system stores a prefix — costs 1.25× the standard input token price for a 5-minute TTL, or 2× for a 1-hour TTL. Every subsequent cache read costs 0.1× the standard input price, a 90% discount. The breakeven math falls out cleanly: on the 5-minute cache, the write costs an extra 0.25× compared to a non-cached request, and each read saves 0.9×, so the cache pays for itself after the first read. On the 1-hour cache, the write costs an extra 1× and each read saves 0.9×, so it pays for itself after the second read. Below those thresholds, caching is more expensive than not caching; above them, every additional read is pure savings. This is why Anthropic's documentation says "caching pays off after just one cache read for the 5-minute duration, or after two cache reads for the 1-hour duration" — the math is exactly that simple.

OpenAI took a different approach: caching is automatic on all GPT-4o and newer models, with no cache_control parameter to set and no write premium to pay. Cached input tokens are charged at 10% of the standard input rate, the same 90% discount as Anthropic, but you don't pay extra for the first request — the cache write is free. The trade-off is that you have less control over what gets cached and when, and the in-memory cache evicts after 5–10 minutes of inactivity (with an extended-retention option that lasts up to 24 hours on certain models). For most workloads, the OpenAI model is operationally simpler, but it produces lower hit rates on workloads where the prefix is just barely above the 1024-token minimum or where eviction during off-peak hours is a problem.

Google Gemini's caching has the most complex pricing model and the highest ceiling of the three. Implicit caching (automatic on 2.5+ models since May 2025) works like OpenAI's: free, automatic, no guaranteed savings. Explicit caching guarantees the discount but charges a separate hourly storage fee — $1 per million cached tokens per hour for Flash models, $4.50 per million per hour for Pro models. The discount is 90% on Gemini 2.5 models and 75% on Gemini 3 Pro, applied to cache reads. The breakeven on explicit caching is therefore not "one read per write" — it's "one read per hour per million cached tokens" on Flash, or roughly four reads per hour on Pro. For workloads that hit the same large context many times per hour, Gemini's explicit caching is the cheapest of the three; for workloads that hit it once or twice, the hourly storage eats most of the savings, and implicit caching (free but no guarantee) is the better choice. The mental model is genuinely different, and it's the place where teams familiar with Anthropic or OpenAI caching get the math wrong on Gemini first.

The four patterns that actually produce high cache hit rates

Real-world cache hit rates cluster around four patterns. Hitting one of them well is the difference between a 50–80% input cost reduction and a number close to zero. The patterns are well documented in production case studies — ProjectDiscovery published a case going from 7% to 74% cache hit rate after one structural fix — but they're rarely spelled out in marketing copy.

Stable system prompts in front of variable user content. This is the most common high-hit-rate pattern. Every request includes the same multi-thousand-token system prompt at the top, followed by the user's request at the bottom. The system prompt — model instructions, tone guidelines, format specifications, examples — gets cached once and reused on every subsequent request within the cache window. A typical chat application with a 3,000-token system prompt and 500-token user messages will see 70–85% of its input tokens cached after the first request, because 3,000 of every 3,500 tokens are the stable prefix. The structural requirement is that the system prompt is genuinely identical byte-for-byte across requests; injecting a timestamp, a user ID, or a session token into the system prompt invalidates the cache and drops the hit rate to zero. Move all dynamic content into the user message, and the cache works as advertised.

RAG with stable retrieved documents. Retrieval-augmented generation workloads cache exceptionally well when the popular documents are popular enough to be retrieved repeatedly. A customer support bot that retrieves the same top-50 help articles for 80% of incoming queries will see the document portion of its prompt cached on every repeat retrieval. The structure is system_prompt + retrieved_documents + user_query, with cache_control placed after the documents. The document content might change across queries, but each individual document is identical when retrieved twice within the cache window. Production hit rates on RAG workloads typically land at 50–70% — lower than pure stable-system-prompt workloads, because document retrieval has more variance — but the documents themselves are usually larger than the system prompt, so the absolute token savings are bigger. RAG is the workload type where caching most often produces the biggest dollar reduction in absolute terms.

Tool definitions and few-shot examples in agent workflows. Multi-step agent workflows tend to call the same tool definitions and the same few-shot example library on every step of the loop. A 10-step agent run with a 5,000-token tool schema and a 3,000-token example block will cache 8,000 tokens × 9 reuses = 72,000 cached read tokens, versus 80,000 standard input tokens without caching. At Claude Sonnet rates, that's a $0.21 savings per loop on the input alone. Multiply across thousands of loops per day and the savings compound quickly. The structural requirement is to put the cache_control breakpoint after the tools and examples, so the cache covers the stable prefix without including the dynamic agent state at the bottom of the prompt. ProjectDiscovery's published case — 7% to 74% hit rate — was an agent workload where they moved one dynamic field out of the cacheable prefix. The lesson is that the cache architecture is determined by where you put the breakpoints, not by which model you choose.

Multi-turn conversations. Each turn in a conversation includes the full history of the prior turns. Without caching, this means the second turn re-sends turn 1's content, the third turn re-sends turns 1 and 2, and so on — a quadratic growth in input tokens that wrecks economics on long conversations. With caching, each turn caches the conversation history up to the most recent message, and the next turn reads it back at 10% of the input price. A 20-turn conversation with 500 input tokens per turn and 200 output tokens per turn goes from a non-cached cost of roughly 100,000 input tokens to a cached cost of roughly 18,000 input tokens — an 82% reduction on the input side. Conversations are the workload where caching most reliably produces the headline 90% savings number; the structure is naturally suited to the cache mechanism.

Three worked examples with real numbers

Numbers throughout use Anthropic's 5-minute TTL pricing on Claude Sonnet 4.6 ($3.00 input / $15.00 output base; $3.75 cache write / $0.30 cache read). The same patterns apply to OpenAI and Gemini at adjusted rates, but Anthropic's explicit pricing makes the math cleanest to see.

Workload A: Customer support chatbot with 4,000-token system prompt, 50,000 conversations/day. Each conversation averages 8 turns, with the system prompt unchanged across all conversations. Without caching, every turn re-sends the 4,000-token system prompt — that's 4,000 × 8 turns × 50,000 conversations = 1.6B input tokens per day going to the system prompt alone, plus user/assistant content on top. At $3/Mtok, system prompt cost alone is $4,800/day or about $144,000/month. With 5-minute caching: the system prompt is written once and read on every subsequent turn within 5 minutes (each conversation easily fits inside that window). Effective per-conversation: 1 cache write at $3.75/Mtok × 4K = $0.015, plus 7 cache reads at $0.30/Mtok × 4K = $0.0084. Total: $0.0234 per conversation × 50K = $1,170/day, or about $35,100/month. Savings: $108,900/month, a 76% reduction on the system prompt portion of the input cost.

Workload B: RAG-powered analysis tool with 100,000-token document context, 200 queries/day. A research analyst tool that retrieves a 100K-token policy document on most queries and asks questions against it. Without caching: 200 queries × 100K = 20M input tokens/day on the document alone, costing $60/day or $1,800/month. With 1-hour caching (because queries are spread across the work day, not bunched within 5 minutes): cache write at 2× = $6/Mtok × 100K = $0.60 per write. If the document is queried 20 times before falling out of the cache, the math is 1 write + 19 reads: $0.60 + 19 × ($0.30/Mtok × 100K) = $0.60 + $5.70 = $6.30 for 20 queries. Versus non-cached: 20 × ($3/Mtok × 100K) = $60. Savings: 90% on those 20 queries. Across 200 queries/day at this hit pattern, monthly cost drops from $1,800 to roughly $180 — exactly the 90% headline number, because the workload is structured precisely the way caching wants it. RAG with a stable document and high re-query rate is where caching's advertised number is most honest.

Workload C: Multi-step agent with 6,000-token tool schema, 30,000 loops/month. A production agent with 8 steps per loop and a tool schema that's identical across steps. Without caching: 8 × 6,000 × 30,000 = 1.44B input tokens on the tool schema alone per month, costing $4,320. With 5-minute caching (each loop fits inside 5 minutes): 1 write + 7 reads per loop at the schema portion. Per-loop schema cost: $3.75/Mtok × 6K = $0.0225 + 7 × ($0.30/Mtok × 6K) = $0.0225 + $0.0126 = $0.0351 per loop. Across 30,000 loops: $1,053. Savings: $3,267/month, a 75% reduction on the schema portion alone.

These three examples, summarized:

WorkloadVolumeWithout cachingWith cachingReduction
Support chatbot system prompt50K conv/day$144,000/mo$35,100/mo76%
RAG analysis tool documents200 q/day$1,800/mo$180/mo90%
Agent tool schemas30K loops/mo$4,320/mo$1,053/mo75%

What the math shows: caching savings are bounded by what fraction of the input is reusable, not by the headline 90% discount. Workload B hits the 90% number because the 100K-token document dominates each request; Workloads A and C hit 75–76% because dynamic user content is a meaningful share of total input. None of them hit zero, and none of them hit 95%. The realistic working range on cache-friendly workloads is 50–85% input cost reduction. Articles that quote 90%+ are quoting workload B's structural specifics; articles that quote 30–40% are usually quoting workloads where the cacheable prefix is thinner than the marketing implies.

The three quiet failures

Three failure modes that consistently surprise teams who turned on caching and didn't see the bill drop.

The minimum token threshold isn't on the marketing page. Anthropic requires a minimum cacheable prefix size that varies by model: 1,024 tokens for Haiku 4.5, 2,048 tokens for Sonnet 4.6, and 4,096 tokens for Opus 4.7. OpenAI's threshold is 1,024 tokens across all GPT-4o and newer models. Gemini's implicit caching threshold is 1,024–2,048 tokens depending on model, and explicit caching requires 32,768 tokens on Pro models. Place a cache_control breakpoint after a 1,500-token system prompt on Sonnet 4.6, and no cache gets created — the system silently treats the request as non-cached and bills you at the standard rate. The cache_creation_input_tokens field in the response will report zero, which is the only way to know it failed. Most teams don't log this field, discover the issue weeks later, and waste the intervening time thinking caching is broken when it never activated. The fix is straightforward (extend the prefix above the threshold) but the diagnosis requires knowing what to look for.

Cache position is rigid; dynamic content before the breakpoint kills the cache. Caching only works on the prefix of the prompt — content from the start of the request up to the cache_control marker. Anything after the marker is not cached. More importantly, anything dynamic before the marker invalidates the entire cache for that request. A common failure pattern: a team puts a timestamp, a user ID, or a session token at the top of the system prompt for "logging purposes," and every request now has a unique prefix that never matches a previously cached one. Hit rate stays at zero, the team blames the cache mechanism, and the fix turns out to be moving one field. ProjectDiscovery's documented case was exactly this — a single dynamic field at the wrong position dropped them from 74% theoretical hit rate to 7% actual. The discipline is: dynamic content goes at the bottom of the prompt, after the cache breakpoint; stable content goes above. This isn't on the marketing page either.

TTL gotchas, especially Anthropic's 5-minute default and Gemini's storage cost. Anthropic's default TTL is 5 minutes, which is fine for chat applications with continuous traffic and disastrous for low-traffic services. A backend service that runs once every 7 minutes has a 0% cache hit rate at the 5-minute TTL — every request triggers a fresh cache write, and the cache always evicts before the next read. The fix is the 1-hour TTL ("cache_control": {"type": "ephemeral", "ttl": "1h"}), which costs 2× cache write but holds for 60 minutes. For services with traffic patterns measured in minutes-between-requests rather than requests-per-second, the 1-hour TTL is the correct default; teams that don't change the TTL from the 5-minute default end up paying cache write premiums on every request and getting no reads. Gemini explicit caching has the inverse problem: storage costs are charged per hour regardless of whether you're using the cache, so a long TTL with low traffic accumulates storage fees that wipe out the discount on reads. The right TTL on every provider is a function of your actual traffic pattern, and the default is rarely it.

The combined effect of these three failure modes is that teams typically see 0–20% effective savings on their first caching attempt, fix the configuration, and end up at 50–85% on the second attempt. The infrastructure works as advertised; the surrounding choices determine whether the advertising matches the bill.

Cross-provider comparison

The same workload behaves differently on each provider's caching implementation. Side-by-side reference for the major dimensions:

DimensionAnthropic ClaudeOpenAIGoogle Gemini
ActivationExplicit (cache_control)AutomaticImplicit (auto) + Explicit (paid)
Cache read discount90% (0.1× input)90% (0.1× input)90% on 2.5+ / 75% on 3 Pro
Cache write premium1.25× (5min) / 2× (1hr)NoneNone (implicit) / Standard rate (explicit)
Storage costNoneNone$1–4.50/Mtok/hour (explicit only)
Min cacheable tokens1,024–4,0961,0241,024–32,768
Default TTL5 minutes5–10 minutes1 hour
Extended TTL1 hour24 hoursConfigurable
Breakeven1 read (5min) / 2 reads (1hr)Automatic~1 read/hr/Mtok stored
Cache control positionUp to 4 explicit breakpointsAuto from prompt startAt cache creation

Two points worth understanding from this table. First, OpenAI's automatic caching is the lowest-overhead option but produces lower hit rates than well-configured Anthropic explicit caching, because the system can't know which prefix you intended to cache versus which prefix happens to repeat. For workloads where the same prefix is used many times, OpenAI's hit rate is fine; for workloads where the prefix varies subtly (different system prompts per user tier, for example), explicit caching gives you more control and higher hit rate at the cost of more setup. Second, Gemini's explicit caching is the only mechanism with a separate storage cost, which makes its breakeven analysis genuinely different from the other two. For workloads with large contexts (>1M tokens cached) and low query rates, Gemini explicit caching can lose money on storage that the cached read discount doesn't recover. The mental model "cache always saves money" is wrong on Gemini in a way it isn't on Anthropic or OpenAI.

When caching saves nothing (or costs more than not caching)

Three real situations where the math says don't cache. The honest version of any caching article has to include these.

Short prompts under the minimum threshold. A workload built around 500-token user prompts and a 300-token system prompt has nothing to cache — the total prefix is under every provider's minimum. The right answer is to leave caching off and focus on cheaper models or batch processing. Adding cache_control here adds nothing but configuration noise.

Dynamic-only workloads with no repeated structure. Search query reformulation, where every prompt is a different user query against a fresh search context, has no stable prefix. Same for one-shot translation, summarization of unique input documents, or any workload where the input is genuinely different on every call. Caching does nothing because there's nothing to reuse. Routing to a cheaper model is the right cost lever for these workloads, not caching.

Low-traffic production services with sparse usage. A backend service that processes one request per 15 minutes will never hit the cache because the TTL evicts between requests. On Anthropic, the 1-hour TTL costs 2× write and never reaches breakeven if there's only one request in the hour; you've paid the write premium and gotten no reads. On Gemini explicit caching, the storage cost compounds across hours of idle time. The right answer here is to leave caching off; the workload is too sparse for the mechanism to pay off, and the configuration is just adding write costs.

A team with one or more of these patterns dominating their traffic should focus optimization elsewhere. For everyone else, caching is the highest-leverage cost optimization available, and it ships before any model migration.

Caching, routing, and batch: how they stack

These three optimizations apply to different parts of the pricing equation, and they compound. Caching reduces the per-token rate on the input side (90% off cached reads). Routing reduces the per-token rate by sending traffic to cheaper models (40–80% reduction depending on tier). Batch processing reduces the per-token rate on async work (50% off both input and output across all major providers). All three apply to different cost dimensions, so they multiply rather than substitute.

A worked example on Workload A from earlier (customer support chatbot, $144,000/month uncached). Apply caching alone: 76% reduction → $35,100/month. Apply routing on top, moving 40% of traffic to Gemini 3 Flash at 80% cheaper while keeping caching active on the Sonnet routes: combined effect is roughly $24,000/month — an 83% total reduction. Apply batch processing on the async portion of the workload (assume 30% of conversations are non-real-time, like end-of-day report generation): another 15% reduction on that portion, taking the total to roughly $21,000/month — an 85% reduction from the starting baseline.

The important point is the order. Caching first, because it requires no eval re-run and applies to whatever model is already in production. Routing second, because the eval work is real but the savings on volume routes are worth it. Batch processing third, because it's the easiest of the three to add and works orthogonally to both. A team that does only one of the three is leaving 60–70% of the available savings on the table; a team that does all three on a cache-friendly workload runs at roughly 15–20% of the original bill.

The aggregator question

A practical issue that doesn't show up in most caching articles: when you call Claude or GPT through an aggregator gateway rather than directly, does caching still work? The answer is "it depends on the aggregator," and the difference is large enough to be worth checking before committing to a gateway.

Aggregators that do caching pass-through correctly forward the cache_control parameter (for Anthropic) or rely on the upstream provider's automatic caching (for OpenAI), and report cache_creation_input_tokens and cache_read_input_tokens in the response so the customer can verify the cache is working. The discount applies to the gateway-priced tokens, which means if the gateway sells Claude Sonnet at $2.55/Mtok input (a 15% structural discount), cache reads run at $0.255/Mtok — the cache discount stacks with the structural discount. This is the configuration that actually saves money.

Aggregators that strip cache headers during translation, or that don't surface the cache metrics in the response, effectively turn caching off. The customer can't tell from the request side; only the bill reveals that 100% of input tokens are being charged at the standard rate. For a workload that would otherwise hit 70% cache utilization, this is a 60% increase in input cost — large enough to outweigh whatever per-token discount the aggregator advertises. The diagnostic is to send a known-cacheable request (long system prompt, repeated immediately) and check whether cache_read_input_tokens shows up in the response. If it doesn't, caching isn't working through that gateway.

The right question to ask any aggregator: does cache_control pass through, are cache_creation_input_tokens and cache_read_input_tokens reported in the response, and does the cached-read discount apply to the gateway's published rate? The answer should be yes on all three. If any are no, the structural pricing the gateway advertises gets eaten by the missed caching savings, and the gateway is more expensive than calling the upstream provider directly for any cache-friendly workload.

A caching playbook that actually works

Five steps, in order. Skip the audit step and the rest doesn't matter.

  1. Audit your current cache hit rate. On Anthropic, log cache_creation_input_tokens, cache_read_input_tokens, and input_tokens on every response, and compute (cache_read / (cache_read + cache_create + input)) over the last 7 days. On OpenAI, log cached_tokens / prompt_tokens. On Gemini, log cachedContentTokenCount / total_input_tokens. If the rate is below 30% on a workload that should be cache-friendly (long system prompt, RAG, or agent loops), there's a structural fix worth running before any other optimization.
  2. Identify the longest stable prefix in your prompts. What's the largest portion of your input that's identical across requests within any 5-minute window? System prompt, tool schema, retrieved documents, conversation history — pick the longest one that genuinely doesn't change. That's where the cache_control breakpoint goes.
  3. Move dynamic content below the breakpoint. Timestamps, user IDs, session tokens, request-specific metadata — relocate all of it to the user message portion of the prompt, after any cache_control marker. Re-deploy and re-measure hit rate. ProjectDiscovery went from 7% to 74% on this single change; expect double-digit improvements on similar workloads.
  4. Match TTL to traffic pattern. Continuous traffic (>1 request per minute on a given prefix) → 5-minute TTL. Bursty traffic (clusters of requests with multi-minute gaps) → 1-hour TTL on Anthropic, extended retention on OpenAI, 1-hour TTL on Gemini explicit. Sparse traffic (less than 1 request per hour) → caching probably isn't worth it; focus optimization elsewhere.
  5. Verify the discount applies if you're using an aggregator. Send a known-cacheable test request through your gateway and check the response for cache_creation_input_tokens and cache_read_input_tokens. If they don't appear, caching isn't passing through. Either fix the gateway configuration or call the provider directly for cache-sensitive workloads.

If steps 1 and 2 reveal that the workload genuinely doesn't have a cacheable prefix, the answer isn't to force caching — it's to focus on routing or batch processing instead. If they reveal a clear cacheable prefix, the savings are typically the largest single optimization available, and shipping the cache configuration is a one-day project that pays for itself in the first week.


If your team is paying full input rates on workloads with long stable prefixes — RAG, agent loops, long system prompts, multi-turn conversations — the missing optimization is usually one configuration change away. Sign in to TokenMart and the dashboard reports cache hit metrics alongside per-model spend, so the diagnostic in step 1 of the playbook becomes a chart instead of a logging project. For workloads that pass the audit and route through TokenMart, the cache discount stacks with the structural pricing — the same Claude Sonnet 4.6 that runs at $2.55/Mtok input with the structural discount runs at $0.255/Mtok on cache reads, which is the configuration most production teams haven't yet enabled. That's the version of the caching analysis worth running before the next round of cost optimization.

FAQ

How much does prompt caching actually save?
Claude and OpenAI both charge 10% of standard input price on cache hits — a 90% discount. Real-world cache hit rates in production land between 40% and 80% on cache-friendly workloads, which translates to 35–75% reduction in input cost (output cost is unchanged). Gemini's discount is also up to 90% on 2.5+ models but charges hourly storage on top, which complicates the math. On any provider, the savings depend more on how much of your prompt is reusable than on the headline discount number.
Should I use prompt caching or smart routing first?
Caching first, routing second. Caching applies to whatever model you're already using, requires no eval re-run, and saves 50–80% on input cost without touching the prompt library. Routing requires evaluating cheaper models against your task, which takes longer and carries quality regression risk. A workload running on Claude Sonnet with no caching wastes more money on uncached repeated prompts than it would on routing decisions, and the caching fix is faster to ship.
What types of workloads benefit most from prompt caching?
Workloads with large stable prefixes and frequent repeat use — RAG applications retrieving the same popular documents, agent workflows reusing the same tool definitions and system prompts, multi-turn conversations where each turn includes the prior history, and any application calling the same lengthy system prompt across many requests. The break-even on Claude is one cache read per write; the break-even on OpenAI is automatic; the break-even on Gemini explicit caching is roughly one read per hour per million cached tokens.
Why is my Claude cache hit rate low?
Three common causes. (1) Your cached prefix is below the minimum token threshold — Sonnet 4.6 requires 2,048 tokens, Opus 4.7 requires 4,096. (2) Dynamic content (timestamps, user IDs, session tokens) appears before the cache_control breakpoint, invalidating the cache. (3) The 5-minute TTL has expired between requests; if your usage cadence is irregular, the cache evicts. ProjectDiscovery documented a public case going from 7% to 74% hit rate by relocating one dynamic field — fix the prefix, not the model.
Do aggregator gateways support prompt caching?
It varies. Some pass through the cache_control parameter and report cache_creation_input_tokens and cache_read_input_tokens in the response — the caching works as if you were calling the upstream provider directly, and the discount applies to the gateway-priced tokens. Others strip the cache header during translation, in which case caching is lost. Confirm in writing before relying on it; the difference can be 50–80% of your input bill.
SAVE ON EVERY TOKENSHIP IN MINUTES★ MEMBER PRICE
OPEN 24/7

Stop paying retail for AI.

One API key. Every frontier model. Up to 75% off list price, billed to the token. Connect once. Start saving immediately.

No commitment · No minimums · Cancel anytime