Response Caching

Beta
Cache responses for identical API requests to save time and money
Beta

Response caching is currently in beta. The API and behavior may change.

Response caching allows you to cache responses for identical API requests. When a cached response is available, OpenRouter returns it immediately from cache with no billing (all billable usage counters are reported as 0), reducing both latency and cost.

Both streaming and non-streaming requests are eligible for caching. Only successful (200 OK) responses are cached. Error responses, rate limit responses, and partial results are never cached. Responses containing tool calls are cached normally since they are part of a successful completion. For streaming requests, the cached response is replayed through the same streaming pipeline, so the client receives the same content chunks on a cache hit. The id field, created timestamp, and X-Generation-Id response header in each chunk reflect the new cache-hit generation record, not the original.

Enabling Caching

There are two ways to enable response caching:

1. Per-Request via Headers

Add the X-OpenRouter-Cache header to enable caching for individual requests:

$curl -i https://openrouter.ai/api/v1/chat/completions \
> -H "Authorization: Bearer {{API_KEY_REF}}" \
> -H "Content-Type: application/json" \
> -H "X-OpenRouter-Cache: true" \
> -d '{
> "model": "google/gemini-2.5-flash",
> "messages":
> [
> {
> "role": "user",
> "content": "What is the meaning of life?"
> }
> ]
> }'

The first request results in a cache MISS. The response is stored and billed normally:

Response Headers (MISS)
1HTTP/2 200
2X-OpenRouter-Cache-Status: MISS
3X-OpenRouter-Cache-TTL: 300
Response Body (MISS)
1{
2 "id": "gen-abc123",
3 "model": "google/gemini-2.5-flash",
4 "choices": ["..."],
5 "usage": {
6 "prompt_tokens": 15,
7 "completion_tokens": 120,
8 "total_tokens": 135
9 }
10}

Sending the same request again returns a cache HIT with zeroed usage and no billing. Each cache hit receives its own unique generation ID (note gen-def456 below, different from the original gen-abc123):

Response Headers (HIT)
1HTTP/2 200
2X-OpenRouter-Cache-Status: HIT
3X-OpenRouter-Cache-Age: 12
4X-OpenRouter-Cache-TTL: 288
5X-Generation-Id: gen-def456
Response Body (HIT)
1{
2 "id": "gen-def456",
3 "created": 1746000012,
4 "model": "google/gemini-2.5-flash",
5 "choices": ["..."],
6 "usage": {
7 "prompt_tokens": 0,
8 "completion_tokens": 0,
9 "total_tokens": 0
10 }
11}

2. Via Presets

You can enable caching for all requests that use a specific preset by configuring these fields in the preset:

FieldTypeDescription
cache_enabledbooleanEnable caching for all requests using this preset
cache_ttl_secondsnumberDefault TTL for cached responses (1-86400 seconds, default 300)

When cache_enabled is set on a preset, caching is automatically applied to every request that references that preset. No X-OpenRouter-Cache header is required.

Example preset configuration:

1{
2 "name": "cached-tests",
3 "cache_enabled": true,
4 "cache_ttl_seconds": 600
5}

How It Works

Two requests are considered identical when they share the same API key, model, endpoint type, streaming mode, and request body (including all parameters). When caching is enabled, OpenRouter generates a cache key from these inputs. If an identical request has been made before and the cached response has not expired, the cached response is returned immediately. Changing any of these–including the model, endpoint, or switching between streaming and non-streaming–produces a different cache key and a cache miss.

Cache is scoped to your API key. Different API keys, even under the same account or organization, do not share cache. Rotating your API key will result in an empty cache for the new key.

Non-determinism: Cached responses are returned verbatim regardless of stochastic parameters like temperature. If you need fresh responses, use X-OpenRouter-Cache-Clear: true or a short TTL.

Cache Key Details

The cache key is derived from your API key, model, endpoint type, streaming mode, and a SHA-256 hash of the request body. Streaming and non-streaming requests are cached separately, so a stream: true request will not return a cached non-streaming response and vice versa. The request body is normalized before hashing, so extra whitespace does not affect the cache key. However, the property order of the JSON body is significant:

  • Different property ordering in logically identical JSON (e.g. {"model":"x","messages":[]} vs {"messages":[],"model":"x"}) will produce different cache keys
  • Omitting optional fields vs. explicitly sending defaults (e.g. temperature: 1.0) produces different keys
  • Attribution headers (e.g. HTTP-Referer, X-Title) and provider-specific headers are not part of the cache key
  • Multimodal requests (images, audio, video, file attachments) are eligible for caching. The full request body, including base64-encoded content, is included in the hash

Precedence

Request headers and preset configuration interact as follows:

  1. If a preset explicitly sets cache_enabled: false, caching is disabled regardless of request headers–the header cannot override a preset opt-out
  2. X-OpenRouter-Cache: false header disables caching even if the preset enables it
  3. X-OpenRouter-Cache: true enables caching when the preset does not configure caching (i.e. cache_enabled is absent)–but cannot override a preset that explicitly sets cache_enabled: false (rule 1 takes precedence)
  4. X-OpenRouter-Cache-TTL header overrides the preset cache_ttl_seconds (default: 300 seconds)
  5. If neither header nor preset is set, caching is off

Concurrent Requests

If two identical requests arrive simultaneously before the first response is written to cache, both result in a cache MISS and are billed independently. There is no request coalescing.

Supported Endpoints

EndpointAPI Format
/api/v1/chat/completionsOpenAI Chat Completions
/api/v1/responsesOpenAI Responses
/api/v1/messagesAnthropic Messages
/api/v1/embeddingsOpenAI Embeddings

Cache keys include an endpoint type discriminator, so requests to different endpoints with identical bodies will not collide.

Provider caching: Some providers offer their own prompt caching (e.g. Anthropic prompt caching, OpenAI cached context). Provider caching is separate from OpenRouter response caching and the two can be used together. OpenRouter caching operates at the request level before the call reaches the provider, while provider caching operates within the provider’s infrastructure.

Request Headers

HeaderValueDescription
X-OpenRouter-CachetrueEnable caching for this request
X-OpenRouter-CachefalseDisable caching for this request (overrides preset)
X-OpenRouter-Cache-TTL<seconds>Custom TTL (1-86400 seconds, default 300)
X-OpenRouter-Cache-CleartrueForce a cache refresh for this request

TTL values that cannot be parsed as an integer (i.e., do not begin with digits) are ignored and fall through to the preset or default TTL. Values beginning with digits are accepted even if they contain trailing non-numeric characters (e.g., 60abc is treated as 60); decimal values are truncated (e.g., 1.5 is treated as 1). Numeric values outside the valid range are clamped to [1, 86400].

Response Headers

HeaderValueDescription
X-OpenRouter-Cache-StatusHIT or MISSWhether the response was served from cache
X-OpenRouter-Cache-Age<seconds>How long the response has been cached (on HIT only)
X-OpenRouter-Cache-TTL<seconds>Remaining TTL on HIT; full TTL on MISS

The X-Generation-Id header is also present on every response (cached or not) and is not specific to caching. On a cache hit, the generation ID is unique to that hit–it is not reused from the original response.

TTL (Time-to-Live)

The TTL controls how long a cached response remains valid.

  • Default: 300 seconds (5 minutes)
  • Range: 1 second to 86400 seconds (24 hours)

You can customize the TTL per-request using the X-OpenRouter-Cache-TTL header, or set a default TTL in your preset configuration.

Cache Clearing

To force a fresh response for a specific request, send the X-OpenRouter-Cache-Clear: true header alongside X-OpenRouter-Cache: true (or with a preset that has cache_enabled: true). This deletes the existing cached entry for that cache key, makes a new request to the provider, and stores the new response. X-OpenRouter-Cache-Clear has no effect unless caching is enabled for the request. This does not clear all cached entries–only the one matching the current request.

The new cache entry uses the TTL from the current request’s X-OpenRouter-Cache-TTL header, the preset cache_ttl_seconds, or the default (300 seconds), following the standard precedence rules.

Billing

Cache hits are free. No tokens are consumed and all billable usage counters are reported as 0. For chat completions and Responses endpoints, usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens are zeroed. For the Embeddings endpoint, usage.prompt_tokens and usage.total_tokens are zeroed (completion_tokens is not present in embeddings responses). For the Anthropic Messages endpoint, usage.input_tokens and usage.output_tokens are zeroed. You are only billed for the original request that populates the cache (a cache MISS).

Cache hits do not count toward provider rate limits since the request never reaches a provider.

Limitations

  • Disabled for account-level Zero Data Retention (ZDR): Response caching is not available when account-level ZDR is enforced, since caching requires temporarily storing response data. Per-request provider.zdr does not affect cache eligibility.
  • Concurrent identical requests: If two identical requests arrive before the first response is cached, both result in a MISS. See Concurrent Requests.
  • Cache eviction: Cached responses may be evicted before TTL expiry under memory pressure. There is no limit on the number of entries you can cache, but eviction under pressure means entries are not guaranteed to survive their full TTL.

Data Retention

Cached responses are stored in edge infrastructure, retained only for the TTL duration, and automatically evicted upon expiry. Cached data is accessible only via the API key that triggered the caching–no other key, account, or organization can retrieve it. Cached data is not used for training or shared with third parties.

Use Cases

Agent Workflows

When an agent workflow fails partway through, you can resume from the point of failure without re-running and re-paying for identical earlier requests. Enable caching at the start of the workflow and all prior steps return immediately from cache on retry.

Unit Testing

Get repeatable responses for your test suite. After the initial run populates the cache, subsequent identical requests return the same cached response every time at zero cost. For deterministic first-run results, use temperature: 0 or a fixed seed.

Repeated Identical Requests

If your application makes the same request multiple times (same model, same messages, same parameters), caching ensures only the first call hits the provider. Subsequent identical calls return immediately from cache at zero cost.

Monitoring Cache Effectiveness

Cache hit and miss status is visible in your Activity log. Each cached request appears as a separate entry with a cache indicator, and you can filter the log to show only cached or non-cached requests. Every cache hit receives its own unique generation ID, so you can track individual cached responses independently.