Skip to content

Prompt Caching

Candela supports Anthropic prompt caching out of the box — automatically injecting cache control headers, tracking cache hit/miss metrics, and calculating the true cost of cached vs. uncached tokens.

When enabled, Anthropic caches the system prompt and early conversation turns at the API level. Subsequent requests that share the same prefix receive a cache hit, dramatically reducing both latency and cost.

Candela manages this transparently:

  1. Injects cache_control markers into eligible message blocks
  2. Tracks cache_creation_input_tokens and cache_read_input_tokens from the response
  3. Applies the correct pricing multiplier based on your TTL setting
  4. Reports cache savings in the dashboard and trace details

Anthropic offers two cache TTL (time-to-live) options with different pricing:

TTLWrite CostRead CostBest For
5 minutes (default)1.25× input price0.1× input priceShort interactive sessions, chat
1 hour2.0× input price0.1× input priceLong coding sessions, agents, batch processing

For Claude Sonnet 4 (claude-sonnet-4-20250514, $3/MTok input):

ScenarioTokensTTLCost
Cache write (first request)10,0005m$0.0375 (10K × $3 × 1.25 / 1M)
Cache write (first request)10,0001h$0.06 (10K × $3 × 2.0 / 1M)
Cache read (subsequent)10,000any$0.003 (10K × $3 × 0.1 / 1M)
No cache (baseline)10,000$0.03 (10K × $3 / 1M)

After just 2 cache reads, the 5m TTL breaks even. After 3 reads, the 1h TTL breaks even — and you get 12× longer cache retention.

In Settings → Prompt Caching, toggle between:

  • Standard (5 min) — lower upfront cost, suitable for short sessions
  • Extended (1 hour) — higher upfront cost, ideal for long coding sessions with Claude Code

The setting takes effect immediately for all subsequent proxy requests.

Toggle the cache TTL programmatically:

Terminal window
# Set 1-hour TTL
curl -X POST http://localhost:8181/_local/api/config \
-H "Content-Type: application/json" \
-d '{"anthropic_cache_ttl": "1h"}'
# Check current config
curl http://localhost:8181/_local/api/config

Set the default in ~/.config/candela/config.yaml:

proxy:
anthropic:
cache_mode: auto # off | auto | system-only
cache_ttl: 5m # 5m (default) | 1h
cache_modeBehavior
offNo cache headers injected
autoCache system prompt + early turns automatically
system-onlyOnly cache the system prompt

Candela tracks cache performance across all Anthropic requests:

MetricDescription
Cache hit ratePercentage of input tokens served from cache
Cache savingsUSD saved vs. full-price input tokens
Write tokensTokens written to cache (charged at 1.25× or 2.0×)
Read tokensTokens read from cache (charged at 0.1×)

These metrics appear in:

  • Dashboard — aggregate cache savings in the cost overview
  • Trace detail — per-request cache breakdown
  • Models view — per-model cache hit rates

For Google Gemini models, cached content is priced as a fraction of the standard input rate. Configure the price multiplier:

Terminal window
# Set Gemini cache price multiplier (0.25 = cached tokens cost 25% of base price)
curl -X POST http://localhost:8181/_local/api/config \
-H "Content-Type: application/json" \
-d '{"gemini_cache_discount": 0.25}'

The current multiplier is reflected in the GET /_local/api/config response, so clients can display the active configuration.

SymptomCauseFix
Cache hit rate is 0%cache_mode set to offSet to auto in config or desktop settings
High write costsUsing 1h TTL with short sessionsSwitch to 5m TTL if sessions are under 5 minutes
Cache not persistingTTL expired between requestsIncrease TTL or reduce time between requests
Cost shows $0 for cache tokensModel not in pricing tableCheck server logs for unrecognized model warnings