Guides

Prompt cache

When two requests share a prefix — the same system prompt, the same long document at the start of the user message, the same earlier turns of a conversation — the gateway can serve the shared portion from cache on the second call and bill it at a discounted rate. You do not need to opt in or send a cache directive on the OpenAI surface; on most catalog models the cache activates automatically.

This page explains which models cache, how to read the cache fields on the response, and how to structure prompts so the cache hits.

What we measured

Probe (2026-05-27) sent the same 2048-token prompt twice in quick succession against each catalog model, then read usage.prompt_tokens_details.cached_tokens on the second response.

Model	Cache on second call	Status
`claude-haiku-4-5`	25178 / 25686 prompt_tokens	✓
`claude-sonnet-4-6`	24901 / 25231	✓
`claude-opus-4-7`	0 / 3233	✗
`deepseek-v4-flash`	2048 / 2171	✓
`deepseek-v4-pro`	2048 / 2171	✓
`glm-5.1`	2048 / 2093	✓
`kimi-k2.6`	2048 / 2098	✓

Six of seven probed models auto-cache on the OpenAI surface. claude-opus-4-7 is the outlier: identical prompts return cached_tokens: 0. This is tracked internally as gateway bug GW-005. If you need prompt caching on opus today, use the Anthropic surface with explicit cache_control markers.

Common cache chunk size across the six working models is 2048 tokens. Prefixes shorter than that may not cache reliably.

Reading cache fields

Every chat-completions response carries the cache breakdown in usage.prompt_tokens_details:

json

{
  "usage": {
    "prompt_tokens": 25231,
    "completion_tokens": 96,
    "total_tokens": 25327,
    "prompt_tokens_details": {
      "cached_tokens": 24901,
      "text_tokens": 0,
      "audio_tokens": 0,
      "image_tokens": 0
    }
  }
}

The fields you care about:

prompt_tokens_details.cached_tokens — how many tokens of the

prompt came from cache. Multiply this by the cached-input rate (see pricing).

prompt_tokens minus cached_tokens — the non-cached portion,

billed at the regular input rate.

On claude-* models, claude_cache_creation_5_m_tokens and claude_cache_creation_1_h_tokens appear in usage when the gateway creates a cache entry from scratch. These count as cache-write costs; see Anthropic's prompt-cache pricing docs for the rate.

Claude inflation caveat

On claude-haiku-4-5 and claude-sonnet-4-6, the prompt_tokens counter inflates ~10× on cache-hit calls relative to the actual unique prompt size. The probe above shows 25686 reported prompt_tokens for a 2048-token prompt sent twice — the inflation reflects Anthropic's internal accounting (cumulative cache reads across the cache window), not a duplication bug.

Implication for cost code: do not compute billable input as prompt_tokens × input_rate. Use:

text

billable_input = (prompt_tokens - cached_tokens) × input_rate
              + cached_tokens × cached_input_rate

The non-claude models (deepseek, glm, kimi) do not exhibit this inflation; their prompt_tokens stays at the actual unique prompt size regardless of cache state.

What gets cached

The gateway caches prefixes — the leading run of tokens shared by two requests. Three patterns reliably persist a cache entry:

At request boundaries. The end of a user message and the end

of an assistant response each create a cache anchor. A multi-turn conversation that grows by one turn reuses the previous turns' anchor.

Common prefix detection. When two requests share a leading

prefix that diverges in the middle, the gateway carves the shared prefix into its own cache entry available to future requests.

Fixed token intervals. For long inputs (long system prompt,

long document context), the gateway creates intermediate cache anchors every N tokens so prefixes can hit even before reaching a message boundary.

A second request hits the cache only if its prefix exactly matches a persisted cache anchor up to the divergence point.

Structuring prompts for cache

Two practical rules:

Put stable content first. System prompt, persistent context

(a document the user is asking about), long instructions — put them at the start of the messages array. Volatile content (the current user question) goes at the end.

Keep prefixes byte-identical. A single character change in the

shared prefix (different whitespace, different timestamp embedded in the system prompt) invalidates the cache for that prefix.

Example: a customer-support assistant that loads a knowledge-base article and answers user questions about it. The article and system prompt go in messages 1 and 2 (always identical); user questions go in message 3+. Second and subsequent questions hit cache for messages 1+2.

Cache eviction

The cache works on a best-effort basis. The gateway does not guarantee a hit rate, and unused cache entries are evicted after some period (the upstream provider's window — Anthropic's 5-minute and 1-hour cache windows are reflected in the claude_cache_creation_5_m_tokens / _1_h_tokens fields).

For workloads where cache hits matter for cost, run a warmup request before the latency-sensitive call.

Cache and output determinism

Cache hits affect the input side only. Output is still generated by the model and remains subject to temperature, top_p, and the model's own non-determinism. Two cache-hit requests with the same seed and temperature: 0 should produce similar outputs; cache does not increase determinism.

Explicit cache control on the Anthropic surface

On the Anthropic surface, you can mark specific content blocks with cache_control: {"type": "ephemeral"} to force a cache entry. This is the standard Anthropic pattern; ezrouter forwards the directive to upstream claude models. The OpenAI surface does not accept cache_control — auto-caching is the only option there.

Pricing — input vs cached-input

rates per model.

Token usage — reading

usage block fields.

Anthropic API — explicit cache control.