Operations Runbook
Day-to-day operations guide for running Candela in production on Google Cloud.
Health Checks
Section titled “Health Checks”# Localcurl http://localhost:8181/healthz
# Production (requires auth)curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" \ https://candela-xxx.a.run.app/healthzResponse:
{"status": "ok"}{"status": "error", "detail": "..."}Monitoring
Section titled “Monitoring”Key Metrics
Section titled “Key Metrics”| Metric | Source | Alert Threshold |
|---|---|---|
| Request latency (p99) | Cloud Run metrics | > 5s |
| Error rate (5xx) | Cloud Run metrics | > 5% |
| Container startup time | Cloud Run metrics | > 30s |
| BigQuery write errors | Application logs | Any |
| Auth failures | "all auth strategies failed" | > 10/min |
| Circuit breaker trips | "circuit breaker tripped" | Any |
| Budget thresholds | "🔔 budget alert" | At 80%, 90%, 100% |
| Span buffer full | "span processor buffer full" | Any |
| Tetragon audit stream | "tetragon audit stream" | Disconnected |
| gRPC audit sink errors | "audit sink write failed" | Any |
Log-Based Alerts
Section titled “Log-Based Alerts”# Budget threshold alertgcloud logging metrics create candela-budget-alert \ --description="Candela budget threshold reached" \ --log-filter='resource.type="cloud_run_revision" AND textPayload=~"budget alert"'
# Circuit breaker alertgcloud logging metrics create candela-circuit-breaker \ --description="Candela circuit breaker tripped" \ --log-filter='resource.type="cloud_run_revision" AND textPayload=~"circuit breaker tripped"'Structured Log Fields
Section titled “Structured Log Fields”Candela uses slog with JSON output:
| Field | Description |
|---|---|
provider | LLM provider name |
model | Model name |
tokens | Total token count |
cost_usd | Calculated cost |
latency | Request duration |
user_id | Authenticated user |
request_id | Unique request ID |
Deployment
Section titled “Deployment”Manual Deploy to Cloud Run
Section titled “Manual Deploy to Cloud Run”PROJECT=your-gcp-projectREGION=us-central1
# Build and pushgcloud builds submit --project $PROJECT -f deploy/cloudbuild.yaml .
# Deploygcloud run services update candela \ --project $PROJECT --region $REGION \ --image $REGION-docker.pkg.dev/$PROJECT/candela/candela-server:latestRolling Back
Section titled “Rolling Back”# List revisionsgcloud run revisions list --project $PROJECT --region $REGION --service candela
# Route 100% traffic to a previous revisiongcloud run services update-traffic candela \ --project $PROJECT --region $REGION \ --to-revisions=candela-00042-abc=100BigQuery Operations
Section titled “BigQuery Operations”Cost Queries
Section titled “Cost Queries”-- Total cost by user, last 7 daysSELECT user_id, SUM(gen_ai_cost_usd) as total_cost, COUNT(*) as call_count, SUM(gen_ai_total_tokens) as total_tokensFROM `candela.spans`WHERE start_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)GROUP BY user_idORDER BY total_cost DESCCost Optimization
Section titled “Cost Optimization”| Optimization | Impact | Status |
|---|---|---|
Time partitioning (start_time, DAY) | ~70% scan cost reduction | ✅ Configured |
Clustering (project_id, trace_id) | ~50% for filtered queries | ✅ Configured |
| Partition expiration | Storage savings | Set in Terraform |
| BI Engine reservation | Sub-second dashboards | Enable in BQ console |
Incident Response
Section titled “Incident Response”Backend Not Starting
Section titled “Backend Not Starting”- Check Cloud Run logs:
gcloud run logs read --project $PROJECT --service candela - Common causes:
- Missing env vars → check
entrypoint.shsubstitution - Firestore connection failed → check project ID and IAM
- BigQuery auth failed → check service account roles
- Missing env vars → check
High Latency
Section titled “High Latency”- Filter logs by
providerto identify slow upstream - Check circuit breaker state in logs
- Check BigQuery slot usage (if using BQ as reader)
- Check Cloud Run instance count (may need
min-instances > 0)
Budget Not Enforcing
Section titled “Budget Not Enforcing”- Check Firestore
budgets/{userId}document - Verify
period_startis in the current period - Inspect
grants/subcollection for grant absorption - Search logs for
"failed to deduct spend"
Proxy Returns 502
Section titled “Proxy Returns 502”- Check upstream provider status (OpenAI, Vertex AI, Anthropic)
- Look for
"circuit breaker tripped"logs - Check ADC token refresh:
"failed to get ADC token" - Verify
vertex_ai.project_idand region in config
Tetragon Audit Pipeline
Section titled “Tetragon Audit Pipeline”- Verify Tetragon is running:
kubectl get pods -n kube-system -l app.kubernetes.io/name=tetragon - Check gRPC audit stream connection: search logs for
"tetragon audit stream" - Inspect
MultiSinkrouting: each audit event should fan out to all configured sinks - If events are missing, check
CloseSend()/ graceful shutdown logs for premature stream termination - Verify
TracingPolicyis applied:kubectl get tracingpolicies
Maintenance
Section titled “Maintenance”Updating Model Pricing & Adding New Models
Section titled “Updating Model Pricing & Adding New Models”When you want to add new models (like Gemini 3.5 Flash) or update built-in model pricing, you have two options:
Option A: Update Code Defaults (Requires Build & Redeploy)
Section titled “Option A: Update Code Defaults (Requires Build & Redeploy)”This is the recommended approach for adding new models long-term so that the proxy ships with correct built-in default rates.
- Modify Defaults: Update the list of models in
pkg/costcalc/calculator.gowithinloadDefaults(). - Write Tests: Add test cases checking the pricing calculation logic in
pkg/costcalc/calculator_test.go. - Run Tests: Verify correctness locally:
Terminal window go test ./pkg/costcalc -v - Build and Redeploy: Run the build pipeline and redeploy to Google Cloud Run:
Terminal window # Build the container imagegcloud builds submit --project $PROJECT -f deploy/cloudbuild.yaml .# Redeploy the Cloud Run service to apply the updategcloud run services update candela \--project $PROJECT --region $REGION \--image $REGION-docker.pkg.dev/$PROJECT/candela/candela-server:latest
Option B: Configure Runtime Overrides (No Code Changes Required)
Section titled “Option B: Configure Runtime Overrides (No Code Changes Required)”You can override model pricing or add temporary support for a new model without rebuilding/redeploying code by modifying your active configuration:
-
Config File (
config.yaml): Add per-model overrides under thepricing.modelsblock:pricing:models:- provider: googlemodel: gemini-3.5-flashinput_per_million: 0.40 # Negociated rate (List: $0.50)output_per_million: 2.40 # Negociated rate (List: $3.00)Note: If you update
config.yamlfor a deployed service, redeploy or restart the Cloud Run service to load the new config. -
Runtime Configuration Endpoint: You can dynamically update configuration parameters and pricing overrides instantly without service restarts:
Terminal window curl -X POST http://localhost:8181/_local/api/config \-H "Content-Type: application/json" \-d '{"pricing": {"models": [{"provider": "google", "model": "gemini-3.5-flash", "input_per_million": 0.40, "output_per_million": 2.40}]}}'
Database Migrations
Section titled “Database Migrations”All backends auto-provision their schema on startup:
| Backend | Strategy | Notes |
|---|---|---|
| DuckDB | Auto CREATE TABLE | No manual migrations |
| SQLite | Auto CREATE TABLE | No manual migrations |
| BigQuery | Auto schema update | Column additions are backward-compatible |
| Firestore | Schema-less | Field additions are backward-compatible |
Related
Section titled “Related”- Deployment Architecture — Production topology
- Storage & CQRS — Backend configuration
- Security — Authentication and authorization