System Prompt Caching
System prompt caching (and context caching) allows you to cache large blocks of static context—such as system instructions, boilerplate codebase references, background documentation, or long conversation histories—at the LLM provider level. Subsequent requests that reuse this context bypass full-cost token parsing, resulting in significantly reduced API costs and sub-second time-to-first-token latency.
Candela integrates with prompt caching transparently, auto-normalizing cache metrics and applying the correct pricing discounts directly to user budget calculations.
Provider Comparison
Section titled “Provider Comparison”While the goal of prompt caching is the same for all providers, the implementation details, billing structures, and minimum token thresholds differ significantly.
| Feature | Anthropic Claude | Google Gemini (1.5/2.5/3.x) |
|---|---|---|
| Caching Technology | Prompt Caching (Prefix Caching) | Context Caching (Persistent Memory) |
| Minimum Prompt Size | 1,024 tokens (Sonnet/Haiku) 2,048 tokens (Opus) | 32,768 tokens (All models) |
| Write Cost (Cache Create) | 1.25× base price (5m TTL) 2.0× base price (1h TTL) | No surcharge (Free to create) |
| Read Cost (Cache Hit) | 90% off base price (0.1× cost) | 75% off base price (0.25× cost) |
| Cache Lifetime (TTL) | Sliding window: 5 mins to 1 hour | User-configurable (Default: 300s / 5m) |
| Ideal Use Case | Fast developer loops, agent actions | Large documents, codebase context, media analysis |
How It Works
Section titled “How It Works”Anthropic Claude
Section titled “Anthropic Claude”When a request is sent, Candela automatically injects cache_control headers into eligible sections of your messages (e.g. system prompt and early turns) if cache_mode is set to auto. Anthropic charges an upfront write surcharge to build the cache, but subsequent turns inside the TTL enjoy a 90% read discount.
Google Gemini
Section titled “Google Gemini”Google Gemini caching is managed as Context Caching in Vertex AI. Because Google does not charge a write surcharge, creating a cache is extremely cost-effective for large payloads. Cached tokens are charged at a flat 75% discount off standard input rates. Candela automatically reads Google’s cachedContentTokenCount response metadata to apply the discount.
Configuring Caching in Candela
Section titled “Configuring Caching in Candela”You can configure caching defaults in your configuration file, or dynamically adjust them at runtime.
Configuration File (config.yaml)
Section titled “Configuration File (config.yaml)”Edit your config.yaml to define default caching behaviors:
proxy: vertex_ai: prompt_caching: true # Enable cache header injection for Anthropic cache_ttl: 5m # TTL for Claude: 5m (1.25x write) or 1h (2.0x write)
# Gemini Caching Discount Override # 0.25 = 75% off cached tokens (Default, matches Google Vertex AI list prices) # 0.00 = Cached tokens are free # 1.00 = No cache discount applied gemini_cache_discount: 0.25Runtime API Updates
Section titled “Runtime API Updates”You can toggle caching settings on-the-fly without restarting the candela-server proxy:
# Set Anthropic cache TTL to 1 hour (ideal for long-running agent tasks)curl -X POST http://localhost:8181/_local/api/config \ -H "Content-Type: application/json" \ -d '{"proxy": {"vertex_ai": {"cache_ttl": "1h"}}}'# Update Gemini cache discount to 90% (0.10) for custom corporate pricing agreementscurl -X POST http://localhost:8181/_local/api/config \ -H "Content-Type: application/json" \ -d '{"gemini_cache_discount": 0.10}'Cost Optimization Strategy
Section titled “Cost Optimization Strategy”To get the most out of system prompt caching:
- Structure your Prompts: Put static instructions, system definitions, tools/functions, and reference documents at the very beginning of your prompt. Put the fast-moving user query at the very end.
- Combine Small System Prompts: If your system prompt is just under the 1,024/2,048 token threshold for Claude, consider adding developer guidelines or schemas to push it past the minimum size and activate caching.
- Choose the Right TTL:
- Use 5 minutes for quick chat sessions.
- Use 1 hour for developer loops (e.g., using Cline/Zed/Cursor with Candela) where files are constantly re-read over an extended period.
- Use Gemini for Large Multi-modal Files: If you are feeding entire PDFs or codebase dumps (exceeding 32k tokens), routing them to Gemini models utilizing Vertex AI Context Caching will yield the highest cost savings since there is no cache write surcharge.