Caching is the easiest way to cut LLM costs. If you're making the same API calls twice, you're wasting money. Here's how to implement effective caching.

Why Cache LLM Responses?

Average savings: 20-40% on API costs
Bonus: 10x faster response times
Implementation time: 30 minutes

Real Example

Without caching:

10,000 requests/day
30% are duplicates
Cost: $300/month

With caching:

7,000 unique requests
3,000 served from cache
Cost: $210/month
Savings: $90/month ($1,080/year)

Types of LLM Caching

1. Exact Match Caching

Cache identical prompts and return the same response.

Best for:

FAQ systems
Product descriptions
Common queries
Static content generation

Implementation:

import { createClient } from 'redis';
import crypto from 'crypto';

const redis = createClient();

async function getCachedResponse(prompt) {
  const key = crypto.createHash('md5').update(prompt).digest('hex');
  const cached = await redis.get(`llm:${key}`);
  
  if (cached) {
    return JSON.parse(cached);
  }
  
  // Call LLM
  const resp openai.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: prompt }]
  });
  
  // Cache for 1 hour
  await redis.setex(`llm:${key}`, 3600, JSON.stringify(response));
  
  return response;
}

Cache hit rate: 15-30% for typical applications

2. Semantic Caching

Cache similar prompts, not just exact matches.

Best for:

Customer support (similar questions)
Search queries
Paraphrased requests

Example:

"How do I reset my password?"
"What's the process for password reset?"
"I forgot my password, help"

All three get the same cached response.

Implementation:

import { CostLens } from 'costlens';

const client = new CostLens({
  apiKey: process.env.OPTIRELAY_API_KEY,
  cache: {
    type: 'semantic',
    similarity: 0.95, // 95% similarity threshold
    ttl: 3600
  }
});

// Automatically uses semantic caching
const resp client.chat.completions.create({
  model: 'gpt-3.5-turbo',
  messages: [{ role: 'user', content: prompt }]
});

Cache hit rate: 40-60% for typical applications

3. Prompt Caching (Anthropic)

Anthropic's native prompt caching for long contexts.

Best for:

Long documents (>10K tokens)
Repeated system prompts
RAG applications

Pricing:

Cached input tokens: 50% discount
Cache writes: 25% markup
Break-even: 2+ uses

Manual Implementation Challenges:

Requires adding cache control markers to every prompt
Must manage cache ordering (cached content must be first)
Need to track cache hit rates manually
Complex to calculate actual ROI

The Easier Way:

CostLens handles Anthropic's prompt caching automatically without manual configuration:

import { CostLens } from 'costlens';

const client = new CostLens({
  apiKey: process.env.COSTLENS_API_KEY,
  enableCache: true,
});

// Automatic caching - no cache_control markers needed
const resp client.chat({
  messages: [
    { role: 'system', content: 'Long system prompt here...' },
    { role: 'user', content: 'Question' }
  ]
});

Savings: 30-50% for long-context applications with zero configuration

Cache Strategy by Use Case

Customer Support Chatbot

Strategy: Semantic caching
TTL: 24 hours
Expected hit rate: 50%
Savings: 40%

Content Generation

Strategy: Exact match caching
TTL: 7 days
Expected hit rate: 20%
Savings: 15%

Document Analysis

Strategy: Prompt caching (Anthropic)
TTL: 5 minutes
Expected hit rate: 80%
Savings: 50%

Real-time Chat

Strategy: No caching (responses must be unique)
Alternative: Cache common responses only

Cache Invalidation Strategies

Time-based (TTL)

// Short TTL for dynamic content
cache.setex(key, 300, value); // 5 minutes

// Long TTL for static content
cache.setex(key, 86400, value); // 24 hours

Event-based

// Invalidate when data changes
async function updateProduct(id, data) {
  await db.products.update(id, data);
  await cache.del(`product:${id}:*`); // Clear all cached responses
}

LRU (Least Recently Used)

// Redis automatically evicts old entries
redis.config('maxmemory-policy', 'allkeys-lru');

Common Caching Mistakes

❌ Mistake 1: Caching Everything

Not all responses should be cached. Don't cache:

Personalized responses
Time-sensitive data
User-specific content

❌ Mistake 2: TTL Too Long

Stale data = bad user experience. Balance freshness vs cost.

❌ Mistake 3: No Cache Warming

Pre-populate cache with common queries during off-peak hours.

❌ Mistake 4: Ignoring Cache Size

Monitor cache memory usage. Set limits to prevent OOM errors.

Advanced: Multi-tier Caching

// L1: In-memory (fastest)
const memoryCache = new Map();

// L2: Redis (fast)
const redis = createClient();

// L3: LLM API (slowest, most expensive)
async function getResponse(prompt) {
  // Check L1
  if (memoryCache.has(prompt)) {
    return memoryCache.get(prompt);
  }
  
  // Check L2
  const cached = await redis.get(prompt);
  if (cached) {
    memoryCache.set(prompt, cached); // Promote to L1
    return cached;
  }
  
  // Call LLM
  const resp llm.complete(prompt);
  
  // Store in both caches
  memoryCache.set(prompt, response);
  await redis.setex(prompt, 3600, response);
  
  return response;
}

Performance:

L1 hit: <1ms
L2 hit: 5-10ms
L3 miss: 500-2000ms

Monitoring Cache Performance

Track these metrics:

const metrics = {
  hits: 0,
  misses: 0,
  hitRate: () => metrics.hits / (metrics.hits + metrics.misses),
  savings: () => metrics.hits * averageCostPerRequest
};

// Log every hour
setInterval(() => {
  console.log(`Cache hit rate: ${metrics.hitRate() * 100}%`);
  console.log(`Estimated savings: $${metrics.savings()}`);
}, 3600000);

CostLens: Automatic Caching

Don't want to build this yourself? CostLens handles it automatically:

import { CostLens } from 'costlens';

const client = new CostLens({
  apiKey: process.env.OPTIRELAY_API_KEY,
  cache: true // That's it!
});

Features:

✅ Semantic caching
✅ Automatic TTL optimization
✅ Multi-tier caching
✅ Cache analytics dashboard
✅ Works with OpenAI, Anthropic, and more

Conclusion

Caching is the lowest-hanging fruit for LLM cost optimization:

Easy to implement (30 minutes)
Immediate savings (20-40%)
Better performance (10x faster)
No quality trade-offs

Start with exact match caching, then add semantic caching for bigger savings.

Want automatic caching? Try CostLens free - caching included out of the box.

LLM Caching: Save 40% on OpenAI and Anthropic Costs

Why Cache LLM Responses?

Real Example

Types of LLM Caching

1. Exact Match Caching

2. Semantic Caching

3. Prompt Caching (Anthropic)

Cache Strategy by Use Case

Customer Support Chatbot

Content Generation

Document Analysis

Real-time Chat

Cache Invalidation Strategies

Time-based (TTL)

Event-based

LRU (Least Recently Used)

Common Caching Mistakes

❌ Mistake 1: Caching Everything

❌ Mistake 2: TTL Too Long

❌ Mistake 3: No Cache Warming

❌ Mistake 4: Ignoring Cache Size

Advanced: Multi-tier Caching

Monitoring Cache Performance

CostLens: Automatic Caching

Conclusion

Related Articles

LLM Latency vs Cost: The Tradeoffs Nobody Talks About

Anthropic's Prompt Caching: The Feature Nobody Uses (But Should)