What We Learned About Prompt Caching in Production

When we started optimizing our AI pipeline costs, we expected model selection to be the biggest lever. It was not. The bigger win came from a relatively small change to how we structured our prompts. This post is about that change.

How Prompt Caching Works

Most AI providers cache the prefix of your prompt and reuse it across calls. If you send the same context repeatedly, you only pay full price the first time. Subsequent calls that share that prefix get a significant discount on the cached portion, typically around 75-90% off those input tokens depending on the provider.

The catch is that caching only works on the stable part of your prompt. If your first sentence changes on every call, the cache never hits because the prefix never matches.

How We Were Doing It Wrong

We had several prompts that looked roughly like this: the first line was something specific to the current request, followed by several paragraphs of instructions and context that never changed. That is about the worst structure you can have for caching. The variable data at the top invalidated the prefix on every single call, so we were paying full price every time even though most of the prompt was identical across requests.

The fix was simple. We moved the static instructions into a system prompt and pushed the variable, request-specific data into the user prompt. The content did not change, just where it lived. The system prompt became long and stable, the user prompt became short and variable, and suddenly the cache had something consistent to work with.

Provider Differences Worth Knowing

Not all providers handle caching the same way, and it matters for how you structure your implementation.

OpenAI and Google both do automatic prefix caching. You do not have to opt in or change anything beyond getting your prompt structure right. If your prefix is stable, they cache it and apply the discount automatically.

Anthropic is different. With Claude, you explicitly mark what you want cached using cache control parameters, and you can control how long the cache persists. The tradeoff is that Anthropic charges for cache writes, so you are making a deliberate decision about what is worth caching versus what is not. For long, stable system prompts that get reused across many calls, it is clearly worth it. For shorter or more variable content, you need to think about whether the write cost is justified by the read savings.

What We Saw

We did not have usage logging in place when we first made the prompt restructuring change, so we cannot cleanly separate how much of the initial cost reduction came from caching versus the model switch we made at the same time. That is something I would do differently if starting over. But we do have logging now, and the per-call numbers are pretty clear.

On Claude Sonnet 4.6, a typical call runs about 1,519 input tokens. With caching, 1,473 of those, about 97%, are served from cache. Without caching that call would cost around $0.0063. With caching it comes in around $0.0024, a roughly 63% reduction on that call.

On Gemini 3.1 Flash-Lite, we have a prompt that runs around 10,400 input tokens. Once the cache is warm, about 8,164 of those tokens are cached, a 78% hit rate. Without caching the call costs $0.002644. With caching it drops to $0.000798, about a 70% reduction.

One thing we noticed with Gemini: the first call after a cold cache pays full price, and the cache warms on the second call. In our logs we can see two calls 26 seconds apart for the same prompt, one at full price and the next at the cached rate. At volume that cold start cost is negligible, but it is worth knowing it exists.

What I can say is that the prompt restructuring itself was not a large amount of work. We had a handful of prompts to update, and the changes were minor in each case. Moving a few lines of variable data out of the top of a user prompt is not a refactor, it is an afternoon.

The Part Most Teams Skip

A lot of teams discover caching later than they should because the default assumption is that prompt structure is about output quality, not cost. You think about what instructions to include, how to phrase them, what examples to add. You are not necessarily thinking about whether your variable data is at the top or the bottom.

If it does not change between calls, it belongs in the system prompt. Keep it stable, keep it consistent, and put your request-specific content in the user prompt where it belongs anyway. If you are on OpenAI or Google, that alone may be enough to start seeing cache hits without any further changes. If you are on Anthropic, take a few minutes to understand the cache control parameters and mark your stable content explicitly.

The effort is low and the payoff is real.