Prompt Caching: Cost & Performance Analysis Across Providers

Prompt caching is a critical new innovation for language model inference - saving developers up to 90% and making long context inputs suddenly viable. Compare features and pricing across all major AI providers below.

Caching requires exact prompt matches and varies by provider - some like OpenAI and DeepSeek offer automatic caching, while others including Google, Anthropic, and Amazon require manual setup. Learn more about how it works in our introduction to prompt caching below.

Pricing

Pricing: Input, Cached Hit and Output

Price: USD per 1M tokens
Reasoning models are indicated by a lightbulb icon

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Price per token for cached prompts (previously processed), typically offering a significant discount compared to regular input price, represented as USD per million tokens. The values shown here are the cache hit price; cache write and cache storage are billed separately and vary by provider — see "Cache pricing by provider" for detail.

The blended bar shown here uses cache hit price only. Other caching costs differ by provider:

  • Anthropic: charges a separate cache write fee, with different rates for 5-minute and 1-hour TTLs (1-hour TTL is more expensive). Blended price charts use Anthropic cache write price for the input leg.
  • Google (Vertex/Gemini): charges a per-hour cache storage fee in addition to cache hit pricing. Some providers also use tiered pricing for prompts above 200K tokens.
  • OpenAI, DeepSeek, others: typically charge only cache hit pricing with no write or storage fee.

See Prompt Caching for the full breakdown.

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Prompt Caching API Specifications

Provider / ModelInput (standard)Cache writeCache hitCache storageOutput (standard)Auto-EnabledMin tokensCache TTLNotes
OpenAI
OpenAI
  • Cache read tokens are 50% cheaper than base input tokens
  • Cache persists up to one hour during off-peak periods
GPT-5.5 (xhigh)$5.00-$0.50-$30.00
10245-10 minutes
GPT-5.5 (high)$5.00-$0.50-$30.00
10245-10 minutes
GPT-5.5 (medium)$5.00-$0.50-$30.00
10245-10 minutes
GPT-5.5 (low)$5.00-$0.50-$30.00
10245-10 minutes
GPT-5.5 (Non-reasoning)$5.00-$0.50-$30.00
10245-10 minutes
GPT-5.4 mini (xhigh)$0.75-$0.07-$4.50---
GPT-5.4 nano (xhigh)$0.20-$0.02-$1.25---
GPT-5.4 nano (medium)$0.20-$0.02-$1.25---
GPT-5.4 mini (medium)$0.75-$0.07-$4.50---
GPT-5.4 nano (Non-Reasoning)$0.20-$0.02-$1.25---
GPT-5.4 mini (Non-Reasoning)$0.75-$0.07-$4.50---
GPT-5.4 (xhigh)$2.50-$0.25-$15.00---

Tiered pricing:

  • ≤272K:

    • Cache hit: $0.25
  • 272K:

    • Cache hit: $0.5
GPT-5.4 (low)$2.50-$0.25-$15.00
10245-10 minutes

Tiered pricing:

  • ≤272K:

    • Cache hit: $0.25
  • 272K:

    • Cache hit: $0.5
GPT-5.4 (Non-reasoning)$2.50-$0.25-$15.00---

Tiered pricing:

  • ≤272K:

    • Cache hit: $0.25
  • 272K:

    • Cache hit: $0.5
GPT-5.3 Codex (xhigh)$1.75-$0.17-$14.00---
GPT-5.2 (xhigh)$1.75-$0.17-$14.00---
GPT-5.2 Codex (xhigh)$1.75-$0.17-$14.00---
GPT-5.2 (medium)$1.75-$0.17-$14.00---
GPT-5.2 (Non-reasoning)$1.75-$0.17-$14.00---
GPT-5.1 (high)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5.1 (Non-reasoning)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 Codex (high)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 (high)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 (medium)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (high)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 (low)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (medium)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 nano (high)$0.05-$0.01-$0.40
10245-10 minutes
GPT-5 nano (medium)$0.05-$0.01-$0.40
10245-10 minutes
GPT-5 (minimal)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 (ChatGPT)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (minimal)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 nano (minimal)$0.05-$0.01-$0.40
10245-10 minutes
o3$2.00-$0.50-$8.00
10245-10 minutes
o4-mini (high)$1.10-$0.28-$4.40
10245-10 minutes
GPT-4.1$2.00-$0.50-$8.00
10245-10 minutes
GPT-4.1 mini$0.40-$0.10-$1.60
10245-10 minutes
GPT-4.1 nano$0.10-$0.03-$0.40
10245-10 minutes
o3-mini$1.10-$0.55-$4.40
10245-10 minutes
o3-mini (high)$1.10-$0.55-$4.40
10245-10 minutes
o1$15.00-$7.50-$60.00
10245-10 minutes
GPT-4o (Nov '24)$2.50-$1.50-$10.00
10245-10 minutes
GPT-4o (Aug '24)$2.50-$1.25-$10.00
10245-10 minutes
GPT-4o mini$0.15-$0.07-$0.60
10245-10 minutes
Anthropic
Anthropic
  • Cache read tokens are 90% cheaper than base input tokens
  • Cache write tokens are 25% more expensive than base input tokens
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---
Claude Opus 4.7 (Non-reasoning, High Effort)$6.25$6.25$0.50-$25.00---
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Sonnet 4.6 (Non-reasoning, High Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Sonnet 4.6 (Non-reasoning, Low Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.6 (Non-reasoning, High Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.5 (Reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude Opus 4.5 (Non-reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude 4.5 Haiku (Reasoning)$1.25$1.25$0.10-$5.00
40965 minutes

1h cache write: $2

Claude 4.5 Haiku (Non-reasoning)$1.25$1.25$0.10-$5.00
40965 minutes

1h cache write: $2

Claude 4.5 Sonnet (Reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.5 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.1 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4.1 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30.0

Claude 4 Sonnet (Reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit:$0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 3 Opus$18.75$18.75$1.50-$75.00
10245 minutes
Google
Google
  • Google supports caching for Gemini models and Anthropic's Claude models.
  • Pricing and usage differs between model families.
Gemini 3.1 Flash-Lite Preview$0.25-$0.03$1.00$1.50
--
Gemini 3.1 Pro Preview$2.00-$0.20$4.50$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 3.1 Pro Preview$2.00-$0.20$4.50$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 3 Flash Preview (Reasoning)$0.50-$0.05$1.00$3.00
204860 minutes
Gemini 3 Flash Preview (Non-reasoning)$0.50-$0.05$1.00$3.00
204860 minutes
Gemini 3 Pro Preview (high)$2.00-$0.20-$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 3 Pro Preview (high)$2.00-$0.20-$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 3 Pro Preview (low)$2.00-$0.20-$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 3 Pro Preview (low)$2.00-$0.20-$12.00
--

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.20
  • 200K:

    • Cache hit: $0.40
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)$0.10-$0.01$1.00$0.40
204860 minutes
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)$0.10-$0.01$1.00$0.40
204860 minutes
Gemini 2.5 Flash-Lite (Reasoning)$0.10-$0.01$1.00$0.40
204860 minutes
Gemini 2.5 Flash-Lite (Non-reasoning)$0.10-$0.01$1.00$0.40
204860 minutes
Gemini 2.5 Pro$1.25-$0.13$4.50$10.00
204860 minutes
Gemini 2.5 Pro$1.25-$0.13$4.50$10.00
204860 minutes
Gemini 2.5 Flash (Reasoning)$0.30-$0.03$1.00$2.50
204860 minutes
Gemini 2.5 Flash (Reasoning)$0.30-$0.03$1.00$2.50
204860 minutes
Gemini 2.5 Flash (Non-reasoning)$0.30-$0.03$1.00$2.50
204860 minutes
Gemini 2.5 Flash (Non-reasoning)$0.30-$0.03$1.00$2.50
204860 minutes
Gemini 2.5 Pro Preview (May' 25)$1.25-$0.13$4.50$10.00
204860 minutes
Gemini 2.0 Flash (Feb '25)$0.15-$0.03$1.00$0.60
204860 minutes
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Sonnet 4.6 (Non-reasoning, High Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.6 (Non-reasoning, High Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.5 (Reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude Opus 4.5 (Non-reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude 4.5 Haiku (Reasoning)$1.25$1.25$0.10-$5.00
40965 minutes

1h cache write: $2

Claude 4.5 Haiku (Non-reasoning)$1.25$1.25$0.10-$5.00
40965 minutes

1h cache write: $2

Claude 4.5 Sonnet (Reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.5 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.1 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4.1 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4 Sonnet (Reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes

1h cache write: $30

Claude 4 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit:$0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 3.7 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes
Claude 3.5 Haiku$1.00$1.00$0.08-$4.00
20485 minutes
GLM-5 (Reasoning)$1.00-$0.10-$3.20---
DeepSeek V3.2 (Reasoning)$0.56-$0.06-$1.68---
xAI
xAI

-Prompt caching is not 100% guaranteed.

Grok 4.3$1.25$1.25$0.20-$2.50---

For requests greater than 200k tokens, pricing is $2.50 per 1M input tokens, $0.40 per 1M cached input tokens, and $5.00 per 1M output tokens

Grok 4.20 0309 v2 (Reasoning)$2.00-$0.20-$6.00---
Grok 4.20 0309 v2 (Non-reasoning)$2.00-$0.20-$6.00---
Grok 4.20 0309 (Reasoning)$2.00-$0.20-$6.00---
Grok 4.20 0309 (Non-reasoning)$2.00-$0.20-$6.00---
Grok 4.1 Fast (Reasoning)$0.20-$0.05-$0.50---
Grok 4 Fast (Reasoning)$0.20-$0.05-$0.50
--
Grok 4 Fast (Non-reasoning)$0.20-$0.05-$0.50
--
Grok Code Fast 1$0.20-$0.02-$1.50
--
Grok 4$3.00-$0.75-$15.00
--
Grok 3 mini Reasoning (high)$0.30-$0.07-$0.50
--
Grok 3 mini Reasoning (high)$0.60-$0.07-$4.00
--
Grok 3$3.00-$0.75-$15.00
--
Grok 3$5.00-$0.07-$25.00
--
Amazon Bedrock
Amazon Bedrock
  • Amazon supports caching for Nova models and Anthropic's Claude models.
  • Pricing and usage differs between model families.
Nova 2.0 Pro Preview (medium)$1.25-$0.31-$10.00---
Nova Premier$2.50-$0.62-$12.50
10005 minutes
Nova Pro$0.80-$0.20-$3.20
10005 minutes

Only supported on US East (N. Virginia) region

Nova Lite$0.06-$0.01-$0.24
10005 minutes

Only supported on US East (N. Virginia) region

Nova Micro$0.04-$0.01-$0.14
10005 minutes

Only supported on US East (N. Virginia) region

Claude Opus 4.7 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---
Claude Opus 4.7 (Non-reasoning, High Effort)$5.00-$0.50-$25.00---
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Sonnet 4.6 (Non-reasoning, High Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.6 (Non-reasoning, High Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.5 (Reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude Opus 4.5 (Non-reasoning)$6.25$6.25$0.50-$25.00
10245 minutes

1h cache write: $10

Claude 4.5 Haiku (Reasoning)$1.25$1.25$0.10-$5.00
40965 minutes
Claude 4.5 Haiku (Non-reasoning)$1.25$1.25$0.10-$5.00
40965 minutes
Claude 4.5 Sonnet (Reasoning)$4.12$4.12$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.5 Sonnet (Non-reasoning)$4.12$4.12$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
    • 1h cache write: $6.00
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
    • 1h cache write: $12.00
Claude 4.1 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes
Claude 4.1 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes
Claude 4 Opus (Reasoning)$18.75$18.75$1.50-$75.00
10245 minutes
Claude 4 Sonnet (Reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
  • 200K:

    • Cache hit: $0.60
    • 5m cache write: $7.50
Claude 4 Opus (Non-reasoning)$18.75$18.75$1.50-$75.00
10245 minutes
Claude 4 Sonnet (Non-reasoning)$3.75$3.75$0.30-$15.00
10245 minutes

Tiered pricing:

  • ≤200K:

    • Cache hit: $0.30
    • 5m cache write: $3.75
  • 200K:

    • Cache hit:$0.60
    • 5m cache write: $7.50
Claude 3.5 Haiku$1.00$1.00$0.08-$4.00
20485 minutes
  • 1h cache write: $1.60
  • Only supported on US West (Oregon) region
Claude 3.5 Sonnet (Oct '24)$3.75$3.75$0.30-$15.00
10245 minutes

Only supported on US West (Oregon) region

Claude 3.5 Sonnet (June '24)$3.75$3.75$0.30-$15.00
10245 minutes

Only supported on US West (Oregon) region

Microsoft Azure
Microsoft Azure

OpenAI models:

  • Cache read tokens are 50% cheaper than base input tokens (Standard deployments)
  • Cache persists up to one hour during off-peak periods
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Sonnet 4.6 (Non-reasoning, High Effort)$3.75$3.75$0.30-$15.00---

1h cache write: $6

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

Claude Opus 4.6 (Non-reasoning, High Effort)$6.25$6.25$0.50-$25.00---

1h cache write: $10

GPT-5.4 mini (xhigh)$0.75-$0.07-$4.50---
GPT-5.4 (xhigh)$2.50-$0.25-$15.00---

Tiered pricing:

  • ≤272K:

    • Cache hit: $0.25
  • 272K:

    • Cache hit: $0.5
GPT-5.2 (xhigh)$1.75-$0.17-$14.00---
GPT-5.2 Codex (xhigh)$1.75-$0.18-$14.00---
GPT-5.1 (high)$1.25-$0.13-$10.00---
GPT-5 Codex (high)$1.25-$0.13-$10.00---
GPT-5 (high)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 (medium)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (high)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 (low)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (medium)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 nano (high)$0.05-$0.01-$0.40
10245-10 minutes
GPT-5 nano (medium)$0.05-$0.01-$0.40
10245-10 minutes
GPT-5 (minimal)$1.25-$0.13-$10.00
10245-10 minutes
GPT-5 mini (minimal)$0.25-$0.03-$2.00
10245-10 minutes
GPT-5 nano (minimal)$0.05-$0.01-$0.40
10245-10 minutes
o3$2.00-$0.50-$8.00
10245-10 minutes
o4-mini (high)$1.10-$0.28-$4.40
10245-10 minutes
GPT-4.1$2.00-$0.50-$8.00
10245-10 minutes
GPT-4.1 mini$0.40-$0.10-$1.60
10245-10 minutes
GPT-4.1 nano$0.10-$0.03-$0.40
10245-10 minutes
o3-mini$1.10-$0.55-$4.40
10245-10 minutes
o3-mini (high)$1.10-$0.55-$4.40
10245-10 minutes
o1$15.00-$7.50-$60.00
10245-10 minutes
GPT-4o (Nov '24)$2.50-$1.25-$10.00
10245-10 minutes
o1-preview$16.50-$8.25-$66.00
10245-10 minutes
GPT-4o (Aug '24)$2.50-$1.25-$10.00
10245-10 minutes
GPT-4o mini$0.15-$0.07-$0.60
10245-10 minutes
GPT-4o (May '24)$5.00-$2.50-$15.00
10245-10 minutes
DeepSeek
DeepSeek
  • Cache read tokens are 50% cheaper on average (up to 90% with cache optimization)
  • Implements Context Caching on Disk technology
  • No guarantee of 100% cache hits
DeepSeek V4 Pro (Reasoning, Max Effort)$1.74-$0.01-$3.48
64-
DeepSeek V4 Pro (Reasoning, High Effort)$1.74-$0.01-$3.48
64-
DeepSeek V4 Flash (Reasoning, Max Effort)$0.14-$0.00-$0.28
64-
DeepSeek V4 Flash (Reasoning, High Effort)$0.14-$0.00-$0.28
64-
DeepSeek V4 Pro (Non-reasoning)$1.74-$0.01-$3.48
--
DeepSeek V4 Flash (Non-reasoning)$0.14-$0.00-$0.28
--
DeepSeek V3.2 (Reasoning)$0.28-$0.01-$0.42---
DeepSeek V3.2 Exp (Reasoning)$0.28-$0.03-$0.42
64-
DeepSeek V3.2 Exp (Non-reasoning)$0.28-$0.03-$0.42
64-
Alibaba Cloud
Alibaba Cloud

-Alibaba offers two cache types: implcit and explicit.

-Implicit cache is automatically enabled. It is billed at 20% of the standard input token price.

-Explicit cache must be activated. It creates a cache for specific content to ensure a deterministic hit within its 5-minute validity period. Tokens used to create the cache are billed at 125% of the standard input token price, while subsequent cache hits are billed at 10% of that price.

Qwen3.6 Max Preview$1.30$1.63$0.13-$7.80---
Qwen3.6 Plus$0.50$0.63$0.05-$3.00---
Qwen3 Max Thinking$1.20-$0.12-$6.00---
Baseten
Baseten
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)$0.30-$0.06-$0.75---
GLM-5 (Reasoning)$0.95-$0.20-$3.15---
GLM-5 (Non-reasoning)$0.95-$0.25-$3.15---
GLM-4.7 (Reasoning)$0.60-$0.12-$2.20---
GLM-4.7 (Non-reasoning)$0.60-$0.12-$2.20---
Kimi K2.5 (Reasoning)$0.60-$0.12-$3.00---
Kimi K2.5 (Non-reasoning)$0.60-$0.12-$3.00---
DeepSeek V3.1 (Non-reasoning)$0.50-$0.25-$1.50---
Cloudflare
Cloudflare

-Prefix caching is enabled by deafult. To maximize cache hit rates, a header must be send.

Kimi K2.6$0.95-$0.16-$4.00---
CoreWeave
CoreWeave
Kimi K2.6$0.95-$0.16-$4.00---
Kimi K2.5 (Reasoning)$0.50-$0.10-$2.85---
Gemma 4 31B (Reasoning)$0.30-$0.13-$1.25---
Databricks
Databricks
GPT-5.2 (xhigh)$1.75$1.75$0.17-$14.00---
GPT-5.1 (high)$1.25$1.25$0.13-$10.00---
GPT-5 (high)$1.25$1.25$0.13-$10.00---
Gemini 3 Pro Preview (high)$2.00$2.50$0.25-$12.00---
DeepInfra
DeepInfra

-Prompt caching is automatic — no extra parameters required.

DeepSeek V4 Pro (Reasoning, Max Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Pro (Reasoning, High Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Flash (Reasoning, Max Effort)$0.14-$0.03-$0.28---
DeepSeek V4 Flash (Reasoning, High Effort)$0.14-$0.03-$0.28---
DeepSeek V3.2 (Reasoning)$0.26-$0.13-$0.38---
DeepSeek V3.2 (Non-reasoning)$0.26-$0.13-$0.38---
DeepSeek V3.1 Terminus (Non-reasoning)$0.21-$0.13-$0.79---
DeepSeek V3.1 (Non-reasoning)$0.21-$0.13-$0.79---
DeepSeek R1 0528 (May '25)$0.50-$0.35-$2.15---
DeepSeek V3 0324$0.20-$0.13-$0.77---
Kimi K2.6$0.75-$0.15-$3.50---
Kimi K2.5 (Reasoning)$0.45-$0.07-$2.25---
Qwen3.6 35B A3B (Non-reasoning)$0.19-$0.20-$1.00---
Qwen3.5 122B A10B (Reasoning)$0.29-$0.14-$2.90---
Qwen3.5 35B A3B (Reasoning)$0.22-$0.10-$2.20---
Qwen3.5 122B A10B (Non-reasoning)$0.29-$0.14-$2.40---
Qwen3.5 35B A3B (Non-reasoning)$0.18-$0.10-$1.00---
Qwen3.5 397B A17B (Reasoning)$0.54-$0.27-$3.40---
Qwen3.5 397B A17B (Non-reasoning)$0.54-$0.27-$3.40---
Qwen3 235B A22B 2507 (Reasoning)$0.23-$0.20-$2.30---
Qwen3 Coder 480B A35B Instruct$0.30-$0.02-$1.00---
Qwen2.5 Instruct 72B$0.36-$0.12-$0.40---
GLM-5.1 (Reasoning)$1.05-$0.26-$3.50---
GLM-5.1 (Non-reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$0.80-$0.16-$2.56---
GLM-5 (Non-reasoning)$0.60-$0.12-$2.08---
GLM-4.7-Flash (Reasoning)$0.06-$0.01-$0.40---
GLM-4.7 (Reasoning)$0.40-$0.08-$1.75---
GLM-4.7 (Non-reasoning)$0.40-$0.08-$1.75---
Gemma 4 31B (Reasoning)$0.13-$0.02-$0.38---
MiniMax-M2.5$0.15-$0.03-$1.15---
Fireworks
Fireworks

-Prompt caching is enabled by default.

-The default discount is 50%, but the exact discount varies by model.

DeepSeek V4 Pro (Reasoning, Max Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Pro (Reasoning, High Effort)$1.74-$0.14-$3.48
--
DeepSeek V3.2 (Reasoning)$0.56-$0.28-$1.68---
Kimi K2.6$0.95-$0.16-$4.00---
Kimi K2.5 (Reasoning)$0.45-$0.07-$2.25---
Kimi K2.5 (Non-reasoning)$0.45-$0.10-$2.25---
GLM-5.1 (Reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$1.00-$0.20-$3.20---
GLM-5 (Non-reasoning)$1.00-$0.20-$3.20---
MiniMax-M2.7$0.30-$0.06-$1.20---
MiniMax-M2.5$0.30-$0.03-$1.20---
FriendliAI
FriendliAI
GLM-5.1 (Reasoning)$1.40-$0.26-$4.40---
GLM-5.1 (Non-reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$1.00-$0.50-$3.20---
MiniMax-M2.5$0.30-$0.06-$1.20---
Kimi K2.5 (Reasoning)$0.60-$0.60-$3.00---
Groq
Groq

-The minimum cacheable prompt length varies by model, ranging from 128 to 1024 tokens depending on the specific model used.

-All cached data automatically expires after 2 hours without use.

gpt-oss-120B (high)$0.15-$0.07-$0.60
-2 hrs
Inception
Inception
Mercury 2$0.25-$0.03-$0.75---
Kimi
Kimi
Kimi K2.6$0.95-$0.16-$4.00---
Kimi K2.6 (Non-reasoning)$0.95-$0.16-$4.00---
Kimi K2.5 (Reasoning)$0.60-$0.10-$3.00---
Kimi K2.5 (Non-reasoning)$0.60-$0.10-$3.00---
Kimi K2 Thinking$1.15-$0.15-$8.00---
Kimi K2 Thinking$0.60-$0.15-$2.50---
Kimi K2$0.60-$0.15-$2.50---
MiniMax
MiniMax

-MiniMax supports Anthropic API compatible caching that is managed through explicit cache_control settings.

MiniMax-M2.7$0.30$0.38$0.06-$1.20---
MiniMax-M2.5$0.30$0.38$0.03-$1.20---
Novita
Novita
DeepSeek V4 Pro (Reasoning, Max Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Pro (Reasoning, High Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Flash (Reasoning, Max Effort)$0.14-$0.03-$0.28---
DeepSeek V4 Flash (Reasoning, High Effort)$0.14-$0.03-$0.28---
DeepSeek V3.2 (Reasoning)$0.27-$0.13-$0.40---
Ling 2.6 Flash$0.10-$0.02-$0.30---
Kimi K2.6$0.95-$0.16-$4.00---
Kimi K2.5 (Reasoning)$0.60-$0.10-$3.00---
Kimi K2.5 (Non-reasoning)$0.60-$0.10-$3.00---
GLM-5.1 (Reasoning)$1.40-$0.26-$4.40---
GLM-5.1 (Non-reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$1.00-$0.20-$3.20---
GLM-5 (Non-reasoning)$1.00-$0.20-$3.20---
MiniMax-M2.7$0.30-$0.06-$1.20---
MiniMax-M2.5$0.30-$0.03-$1.20---
Qwen3.5 35B A3B (Reasoning)$0.25-$0.25-$2.00---
KAT-Coder-Pro V1$0.30-$0.04-$1.20---
Parasail
Parasail
DeepSeek V4 Flash (Reasoning, Max Effort)$0.14-$0.07-$0.28---
DeepSeek V4 Flash (Reasoning, High Effort)$0.14-$0.07-$0.28---
DeepSeek V3.2 (Reasoning)$0.28-$0.13-$0.45---
Kimi K2.6$0.60-$0.20-$2.80---
Kimi K2.5 (Reasoning)$0.50-$0.20-$2.50---
Qwen3.6 35B A3B (Reasoning)$0.35-$0.20-$2.00---
Qwen3.6 35B A3B (Non-reasoning)$0.35-$0.20-$2.00---
Qwen3.5 397B A17B (Reasoning)$0.60-$0.30-$3.60---
Qwen3 Coder Next$0.15-$0.10-$0.80---
Qwen3 VL 235B A22B Instruct$0.21-$0.10-$1.90---
Qwen3 Next 80B A3B Instruct$0.25-$0.07-$1.10---
Qwen3 235B A22B 2507 Instruct$0.15-$0.05-$0.85---
GLM-5.1 (Reasoning)$1.40-$0.26-$4.40---
GLM-5.1 (Non-reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$1.00-$0.20-$3.20---
GLM-4.7 (Reasoning)$0.45-$0.11-$2.10---
GLM-4.7 (Non-reasoning)$0.45-$0.11-$2.10---
Gemma 4 31B (Reasoning)$0.14-$0.07-$0.40---
Gemma 4 26B A4B (Reasoning)$0.13-$0.05-$0.40---
Gemma 3 27B Instruct$0.25-$0.04-$0.40---
Trinity Large Thinking$0.22-$0.06-$0.85---
MiniMax-M2.5$0.17-$0.03-$1.20---
gpt-oss-120B (high)$0.10-$0.06-$0.75---
gpt-oss-120B (low)$0.15-$0.06-$0.60---
Llama 4 Maverick$0.35-$0.17-$0.85---
Llama 3.3 Instruct 70B$0.22-$0.11-$0.50---
SiliconFlow
SiliconFlow
DeepSeek V4 Pro (Reasoning, Max Effort)$1.74-$0.14-$3.48
--
DeepSeek V4 Pro (Reasoning, High Effort)$1.74-$0.14-$3.48---
DeepSeek V4 Flash (Reasoning, Max Effort)$0.14-$0.03-$0.28---
DeepSeek V4 Flash (Reasoning, High Effort)$0.14-$0.03-$0.28---
DeepSeek V3.2 (Reasoning)$0.27-$0.14-$0.42---
DeepSeek V3.2 (Non-reasoning)$0.27-$0.14-$0.42---
Kimi K2.6$1.40-$0.26-$4.40---
Kimi K2.5 (Reasoning)$0.23-$0.07-$3.00---
GLM-5.1 (Reasoning)$1.40-$0.26-$4.40---
GLM-5.1 (Non-reasoning)$1.40-$0.26-$4.40---
GLM-5 (Reasoning)$1.00-$0.20-$3.20---
GLM-5 (Non-reasoning)$1.00-$0.20-$3.20---
GLM-4.7 (Reasoning)$0.42-$0.11-$2.20---
GLM-4.7 (Non-reasoning)$0.42-$0.11-$2.20---
MiniMax-M2.5$0.20-$0.03-$1.00---
StreamLake
StreamLake
KAT Coder Pro V2$0.30-$0.06-$1.20---
Together.ai
Together.ai
DeepSeek V4 Pro (Reasoning, Max Effort)$2.10-$0.20-$4.40---
DeepSeek V4 Pro (Reasoning, High Effort)$2.10-$0.20-$4.40---
MiniMax-M2.7$0.30-$0.06-$1.20---
MiniMax-M2.5$0.30-$0.06-$1.20---
Qwen3.5 397B A17B (Reasoning)$0.60-$0.60-$3.60---
Xiaomi
Xiaomi
MiMo-V2.5-Pro$1.00-$0.20-$3.00---
MiMo-V2.5$0.40-$0.08-$2.00---
MiMo-V2.5-Pro (Non-reasoning)$1.00-$0.20-$3.00---
MiMo-V2-Omni-0327$0.40-$0.08-$2.00---
MiMo-V2-Flash (Feb 2026)$0.10-$0.01-$0.30---

Introduction to Prompt Caching

What is Prompt Caching?

Prompt caching is a critical new innovation for language model inference - saving developers up to 90% and making long context inputs suddenly viable. Getting your approach to caching right can deliver huge cost savings on input tokens and meaningful performance benefits.

When you send a prompt, the system first checks if that exact prompt has been processed before. If found (cache hit), it returns the stored response instead of generating a new one. If not found (cache miss), the prompt is processed normally, and the response is stored for future use.

Key Metrics to Watch

  • Input Price: The standard price you pay for input tokens
  • Cache Write Price: What you pay to save prompt tokens into the cache; sometimes higher than standard input pricing
  • Cache Hit Price: Discounted rate for prompt tokens that hit the cache
  • Cache Storage Price: Hourly cost per million cached tokens (currently unique to Google)
  • Cache TTL: The time cached tokens remain available, ranging from hours to days
  • Cache Minimum Tokens: Minimum matching token count required before a cache hit is served

How Does Prompt Caching Work?

When you send a prompt to a transformer-based language model, the attention layers process each input token into key (K) and value (V) vectors that are stored in the KV cache. By keeping these values in memory, processing on input tokens can be avoided when identical input tokens are sent into the model again.

Until recently, leveraging the speed and cost benefits of caching was only available for dedicated deployments. Now, serverless API providers—including the frontier labs—have begun passing on some of the cost benefits of caching to developers.

Optimal Use Cases

  • System instructions: Large system prompts that must be included across many interactions
  • Chat history: Conversation context that accompanies each new user turn
  • Per-user personalized context: Extensive user memories or profiles that enable deep personalization

Implementation Considerations

  • Activation method varies by provider - some require manual setup while others offer automatic caching
  • Cache hit discounts range from 50-90% off standard input token pricing - this really is worth the time to get right
  • Caching improves performance for very long prompts (50k+ tokens); benchmarks coming soon