Pricing & Cost Calculator
Candela’s cost engine calculates the USD cost of every proxied LLM request based on token counts and model pricing. Pricing is built-in for popular models and can be overridden via config for enterprise discounts.
How It Works
Section titled “How It Works”- Proxy captures
input_tokensandoutput_tokensfrom the LLM response - Cache normalization — adjusts input tokens to account for prompt caching discounts (see below)
- Cost engine resolves pricing: config override → built-in default → model-name fallback
- Applies model-level discount, then global discount
- Final cost is recorded on the span and deducted from the user’s budget
cost = (normalized_input / 1M × input_per_million + output_tokens / 1M × output_per_million) × (1 - model_discount) × (1 - global_discount)Local models (Ollama, vLLM, LM Studio) always return $0.00 — they run on your hardware.
Supported Providers
Section titled “Supported Providers”Candela routes to cloud models through Vertex AI (Google’s unified AI platform), which provides access to Google, Anthropic, and open-source models — all authenticated via Application Default Credentials (ADC). OpenAI models are called directly via API key.
Any model available in your Vertex AI Model Garden or via direct API is supported. Candela ships with built-in pricing for the most common models:
| Provider | Key Models | Auth |
|---|---|---|
| Google Gemini | Gemini 3.5 Pro/Flash, Gemini 3.1 Pro/Flash, Gemini 2.5 Pro/Flash, Gemini 2.0 Flash | ADC (Vertex AI) |
| Anthropic | Claude Sonnet, Claude Opus, Claude Haiku | ADC (Vertex AI) |
| OpenAI | GPT-4o, o3, o1 | API key |
| Local | Llama, Qwen, Mistral, DeepSeek — anything in Ollama | Free ($0.00) |
Custom Pricing Overrides
Section titled “Custom Pricing Overrides”Override pricing for negotiated rates or enterprise discounts in your config.yaml:
pricing: # Global discount applied to ALL models (0.0–1.0) discount_percent: 0.15 # 15% off all models
# Per-model overrides (takes priority over built-in defaults) models: - provider: google model: gemini-3.5-flash input_per_million: 0.40 # negotiated rate (list: $0.50) output_per_million: 2.40 # negotiated rate (list: $3.00) discount_percent: 0.0 # no additional discount
- provider: openai model: gpt-4o input_per_million: 2.00 output_per_million: 8.00 discount_percent: 0.10 # additional 10% model discountPricing Resolution Order
Section titled “Pricing Resolution Order”- Config override (exact
provider/modelmatch) - Built-in default (exact
provider/modelmatch) - Model-name fallback (provider-agnostic, alphabetical provider tie-break)
This means gemini-2.5-pro will match even if the request doesn’t specify a provider.
Unknown Models
Section titled “Unknown Models”If a cloud model has no pricing configured, Candela:
- Logs a warning (once per model, not per request)
- Records cost as
$0.00 - The request still succeeds — pricing gaps don’t block traffic
Runtime Pricing Updates
Section titled “Runtime Pricing Updates”Pricing can be updated at runtime via the Go API (useful for dynamic pricing from a config service):
calc.SetPricing(costcalc.ModelPricing{ Provider: "google", Model: "gemini-2.5-pro", InputPerMillion: 5.00, OutputPerMillion: 25.00,})
calc.SetGlobalDiscount(0.20) // 20% off everythingPrompt Cache Normalization
Section titled “Prompt Cache Normalization”When LLMs use prompt caching, providers charge less for cached tokens than fresh input. Candela automatically normalizes cached tokens to their cost-equivalent value before calculating the final price.
Cache Discount Rates
Section titled “Cache Discount Rates”| Provider | Cache Read | Cache Write | Source |
|---|---|---|---|
| Anthropic | 90% off (0.1×) | 25% surcharge (1.25×) | Anthropic pricing |
| Google Gemini 2.5+/3.x/3.5 | 90% off (0.1×) | No surcharge | Vertex AI pricing |
| Google Gemini 2.0 | 75% off (0.25×) | No surcharge | Vertex AI pricing |
| OpenAI | 50% off (0.5×) | No surcharge | OpenAI pricing |
How It Works
Section titled “How It Works”Each provider reports cache tokens differently. Candela handles both modes automatically:
- Inclusive (OpenAI, Google) —
input_tokensincludes cached tokens as a subset. Candela subtracts them and re-adds at the discounted rate. - Additive (Anthropic) —
input_tokensis only fresh tokens. Cache tokens are separate fields. Candela adds their discounted equivalents on top.
# OpenAI/Google (inclusive): input_tokens = 1000 (includes 800 cached)normalized = (1000 - 800) + (800 × 0.5) = 200 + 400 = 600
# Anthropic (additive): input_tokens = 200 (fresh only), cache_read = 800normalized = 200 + (800 × 0.1) = 200 + 80 = 280Custom Cache Discount Overrides
Section titled “Custom Cache Discount Overrides”Override cache rates per provider via the Go API (e.g., for negotiated enterprise pricing or new provider integrations):
calc.SetCacheDiscount("anthropic", costcalc.CacheDiscountConfig{ ReadDiscount: 0.1, // 90% off cache reads CreateMultiplier: 1.25, // 25% surcharge on cache writes InputIncludesCache: false, // Anthropic uses additive mode})Tiered Pricing
Section titled “Tiered Pricing”Some models have different rates based on input context length. Candela supports this automatically:
| Model | Threshold | Below | Above |
|---|---|---|---|
| Gemini 2.5 Pro | 200K tokens | $1.25 / $10.00 (in/out per M) | $2.50 / $15.00 |
Tiered pricing is also configurable per-model:
pricing: models: - provider: google model: gemini-2.5-pro input_per_million: 1.25 output_per_million: 10.00 input_per_million_high: 2.50 output_per_million_high: 15.00 tier_threshold_tokens: 200000