Prompt Caching
Candela supports Anthropic prompt caching out of the box — automatically injecting cache control headers, tracking cache hit/miss metrics, and calculating the true cost of cached vs. uncached tokens.
How Anthropic Prompt Caching Works
Section titled “How Anthropic Prompt Caching Works”When enabled, Anthropic caches the system prompt and early conversation turns at the API level. Subsequent requests that share the same prefix receive a cache hit, dramatically reducing both latency and cost.
Candela manages this transparently:
- Injects
cache_controlmarkers into eligible message blocks - Tracks
cache_creation_input_tokensandcache_read_input_tokensfrom the response - Applies the correct pricing multiplier based on your TTL setting
- Reports cache savings in the dashboard and trace details
Cache TTL Modes
Section titled “Cache TTL Modes”Anthropic offers two cache TTL (time-to-live) options with different pricing:
| TTL | Write Cost | Read Cost | Best For |
|---|---|---|---|
| 5 minutes (default) | 1.25× input price | 0.1× input price | Short interactive sessions, chat |
| 1 hour | 2.0× input price | 0.1× input price | Long coding sessions, agents, batch processing |
Cost Calculation Example
Section titled “Cost Calculation Example”For Claude Sonnet 4 (claude-sonnet-4-20250514, $3/MTok input):
| Scenario | Tokens | TTL | Cost |
|---|---|---|---|
| Cache write (first request) | 10,000 | 5m | $0.0375 (10K × $3 × 1.25 / 1M) |
| Cache write (first request) | 10,000 | 1h | $0.06 (10K × $3 × 2.0 / 1M) |
| Cache read (subsequent) | 10,000 | any | $0.003 (10K × $3 × 0.1 / 1M) |
| No cache (baseline) | 10,000 | — | $0.03 (10K × $3 / 1M) |
After just 2 cache reads, the 5m TTL breaks even. After 3 reads, the 1h TTL breaks even — and you get 12× longer cache retention.
Configuring Cache TTL
Section titled “Configuring Cache TTL”Candela Desktop
Section titled “Candela Desktop”In Settings → Prompt Caching, toggle between:
- Standard (5 min) — lower upfront cost, suitable for short sessions
- Extended (1 hour) — higher upfront cost, ideal for long coding sessions with Claude Code
The setting takes effect immediately for all subsequent proxy requests.
Runtime API
Section titled “Runtime API”Toggle the cache TTL programmatically:
# Set 1-hour TTLcurl -X POST http://localhost:8181/_local/api/config \ -H "Content-Type: application/json" \ -d '{"anthropic_cache_ttl": "1h"}'
# Check current configcurl http://localhost:8181/_local/api/configConfig File
Section titled “Config File”Set the default in ~/.config/candela/config.yaml:
proxy: anthropic: cache_mode: auto # off | auto | system-only cache_ttl: 5m # 5m (default) | 1hcache_mode | Behavior |
|---|---|
off | No cache headers injected |
auto | Cache system prompt + early turns automatically |
system-only | Only cache the system prompt |
Cache Metrics in the Dashboard
Section titled “Cache Metrics in the Dashboard”Candela tracks cache performance across all Anthropic requests:
| Metric | Description |
|---|---|
| Cache hit rate | Percentage of input tokens served from cache |
| Cache savings | USD saved vs. full-price input tokens |
| Write tokens | Tokens written to cache (charged at 1.25× or 2.0×) |
| Read tokens | Tokens read from cache (charged at 0.1×) |
These metrics appear in:
- Dashboard — aggregate cache savings in the cost overview
- Trace detail — per-request cache breakdown
- Models view — per-model cache hit rates
Gemini Cache Price Multiplier
Section titled “Gemini Cache Price Multiplier”For Google Gemini models, cached content is priced as a fraction of the standard input rate. Configure the price multiplier:
# Set Gemini cache price multiplier (0.25 = cached tokens cost 25% of base price)curl -X POST http://localhost:8181/_local/api/config \ -H "Content-Type: application/json" \ -d '{"gemini_cache_discount": 0.25}'The current multiplier is reflected in the GET /_local/api/config response, so clients can display the active configuration.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Cause | Fix |
|---|---|---|
| Cache hit rate is 0% | cache_mode set to off | Set to auto in config or desktop settings |
| High write costs | Using 1h TTL with short sessions | Switch to 5m TTL if sessions are under 5 minutes |
| Cache not persisting | TTL expired between requests | Increase TTL or reduce time between requests |
| Cost shows $0 for cache tokens | Model not in pricing table | Check server logs for unrecognized model warnings |