RAG systems have hidden costs most developers miss. Here's a transparent breakdown of what you're actually paying for.
Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications with custom knowledge. But the cost structure is more complex than simple LLM calls.
Let me break down exactly where your money goes—with real numbers from production systems.
Get AI pricing updates when models launch
Join 50+ engineering leaders. No spam.
A typical RAG request involves multiple billable operations:
1. Embedding Generation
2. Vector Database Query
3. LLM Inference
4. (Optional) Reranking
Each layer has its own pricing model. Let's examine them.
Every document you index and every query you process requires embeddings.
OpenAI text-embedding-3-small:
OpenAI text-embedding-3-large:
Example calculation:
Key insight: Embedding costs are usually negligible compared to LLM inference. Don't over-optimize here.
Pinecone (Serverless):
Typical usage (100K queries/month):
Weaviate Cloud:
Self-hosted (AWS):
Key insight: Vector DB costs are predictable and relatively low. Storage and queries are cheap—compute for reranking is where costs can spike.
This is where 90%+ of your RAG costs come from.
Typical RAG prompt structure:
Output: ~500 tokens average
Cost per query (Claude 3.5 Sonnet):
At 100K queries/month: $1,440
Cost per query (GPT-4o):
At 100K queries/month: $1,900
More c answers, but exponentially higher costs.
3 chunks (2K tokens):
10 chunks (6K tokens):
20 chunks (12K tokens):
Key decision: Find the minimum context that maintains quality. Every extra chunk costs you.
Reranking improves relevance by scoring retrieved chunks before sending to the LLM.
Cohere Rerank API:
Cost impact:
ROI calculation:
If reranking lets you use 3 chunks instead of 5:
Reranking often pays for itself.
Customer support chatbot (100K queries/month):
Embeddings (queries): $0.10
Vector DB (Pinecone): $0.30
Reranking (Cohere): $100.00
LLM (Claude 3.5 Sonnet): $1,440.00
--------------------------------
Total: $1,540.30/month
LLM inference = 93.5% of total cost.
Before: Retrieve 10 chunks, send all to LLM
After: Retrieve 10 chunks, rerank, send top 3
Savings: ~40% on LLM costs
Implementation:
// Retrieve more candidates
const candidates = await vectorDB.query(embedding, { topK: 10 });
// Rerank to find best 3
const reranked = await cohere.rerank({
query: userQuery,
documents: candidates,
topN: 3,
});
// Send only top 3 to LLM
const c => r.document).join('\n');
If your system prompt and instructions are constant, cache them.
With Anthropic's prompt caching:
Savings: Minimal per query, but adds up at scale.
Not every query needs your most expensive model.
Simple queries (FAQ, definitions):
Complex queries (analysis, reasoning):
Learn more about choosing between GPT-4o and Claude 3.5 Sonnet for different use cases.
Implementation with CostLens:
import { CostLens } from 'costlens';
const client = new CostLens({
apiKey: process.env.COSTLENS_API_KEY,
smartRouting: true,
});
// CostLens analyzes query complexity and routes appropriately
const resp client.chat({
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: `Context: ${context}\n\nQuestion: ${query}` }
],
});
Many queries are similar or identical.
Strategy:
Typical cache hit rate: 30-50%
Savings: 30-50% on LLM costs
import { CostLens } from 'costlens';
const client = new CostLens({
apiKey: process.env.COSTLENS_API_KEY,
enableCache: true,
cacheTTL: 3600, // 1 hour
});
// Automatic caching of similar queries
const resp client.chat({
messages: [...],
});
Track these metrics:
1. Cost per query
2. Average context size
3. Cache hit rate
4. Model distribution
SaaS documentation chatbot:
Before optimization:
After optimization:
Savings: $810/month ($9,720 annually)
RAG cost hierarchy:
Focus your optimization efforts on:
Don't waste time on:
Track your costs with CostLens to see exactly where your money goes. Most teams discover they can cut RAG costs by 50-70% without sacrificing quality.
Liked this analysis? We publish one deep-dive per week.
AI pricing, model benchmarks, and real cost data.
See what AI is actually costing your team
Real data from a real engineering team. No sign-up required.