Complete guide to LLM response caching. Learn semantic caching, prompt caching, and implementation strategies to reduce AI API costs by 40%.

Caching is the easiest way to cut LLM costs. If you're making the same API calls twice, you're wasting money. Here's how to implement effective caching.
Get AI pricing updates when models launch
Join 50+ engineering leaders. No spam.
Average savings: 20-40% on API costs
Bonus: 10x faster response times
Implementation time: 30 minutes
Without caching:
With caching:
Cache identical prompts and return the same response.
Best for:
Implementation:
import { createClient } from 'redis';
import crypto from 'crypto';
const redis = createClient();
async function getCachedResponse(prompt) {
const key = crypto.createHash('md5').update(prompt).digest('hex');
const cached = await redis.get(`llm:${key}`);
if (cached) {
return JSON.parse(cached);
}
// Call LLM
const resp openai.chat.completions.create({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: prompt }]
});
// Cache for 1 hour
await redis.setex(`llm:${key}`, 3600, JSON.stringify(response));
return response;
}
Cache hit rate: 15-30% for typical applications
Cache similar prompts, not just exact matches.
Best for:
Example:
All three get the same cached response.
Implementation:
import { CostLens } from 'costlens';
const client = new CostLens({
apiKey: process.env.OPTIRELAY_API_KEY,
cache: {
type: 'semantic',
similarity: 0.95, // 95% similarity threshold
ttl: 3600
}
});
// Automatically uses semantic caching
const resp client.chat.completions.create({
model: 'gpt-3.5-turbo',
messages: [{ role: 'user', content: prompt }]
});
Cache hit rate: 40-60% for typical applications
Anthropic's native prompt caching for long contexts.
Best for:
Pricing:
Manual Implementation Challenges:
The Easier Way:
CostLens handles Anthropic's prompt caching automatically without manual configuration:
import { CostLens } from 'costlens';
const client = new CostLens({
apiKey: process.env.COSTLENS_API_KEY,
enableCache: true,
});
// Automatic caching - no cache_control markers needed
const resp client.chat({
messages: [
{ role: 'system', content: 'Long system prompt here...' },
{ role: 'user', content: 'Question' }
]
});
Savings: 30-50% for long-context applications with zero configuration
// Short TTL for dynamic content
cache.setex(key, 300, value); // 5 minutes
// Long TTL for static content
cache.setex(key, 86400, value); // 24 hours
// Invalidate when data changes
async function updateProduct(id, data) {
await db.products.update(id, data);
await cache.del(`product:${id}:*`); // Clear all cached responses
}
// Redis automatically evicts old entries
redis.config('maxmemory-policy', 'allkeys-lru');
Not all responses should be cached. Don't cache:
Stale data = bad user experience. Balance freshness vs cost.
Pre-populate cache with common queries during off-peak hours.
Monitor cache memory usage. Set limits to prevent OOM errors.
// L1: In-memory (fastest)
const memoryCache = new Map();
// L2: Redis (fast)
const redis = createClient();
// L3: LLM API (slowest, most expensive)
async function getResponse(prompt) {
// Check L1
if (memoryCache.has(prompt)) {
return memoryCache.get(prompt);
}
// Check L2
const cached = await redis.get(prompt);
if (cached) {
memoryCache.set(prompt, cached); // Promote to L1
return cached;
}
// Call LLM
const resp llm.complete(prompt);
// Store in both caches
memoryCache.set(prompt, response);
await redis.setex(prompt, 3600, response);
return response;
}
Performance:
Track these metrics:
const metrics = {
hits: 0,
misses: 0,
hitRate: () => metrics.hits / (metrics.hits + metrics.misses),
savings: () => metrics.hits * averageCostPerRequest
};
// Log every hour
setInterval(() => {
console.log(`Cache hit rate: ${metrics.hitRate() * 100}%`);
console.log(`Estimated savings: $${metrics.savings()}`);
}, 3600000);
Don't want to build this yourself? CostLens handles it automatically:
import { CostLens } from 'costlens';
const client = new CostLens({
apiKey: process.env.OPTIRELAY_API_KEY,
cache: true // That's it!
});
Features:
Caching is the lowest-hanging fruit for LLM cost optimization:
Start with exact match caching, then add semantic caching for bigger savings.
Want automatic caching? Try CostLens free - caching included out of the box.
Liked this analysis? We publish one deep-dive per week.
AI pricing, model benchmarks, and real cost data.
See what AI is actually costing your team
Real data from a real engineering team. No sign-up required.