If you're an engineering team building Retrieval-Augmented Generation (RAG) systems, you've probably hit a wall. Your LLM bills are through the roof, and you're scrambling to cut OpenAI API costs or find a cheaper alternative to gpt-4o. But what if the real problem isn't the generative LLM itself, but the hidden costs lurking in your embedding models and inefficient retrieval?
I've seen it repeatedly: teams meticulously track LLM inference tokens, yet overlook the compounding expenses of document embedding and the retrieval process that feeds those expensive LLMs. A senior AI engineer recently shared his experience of cutting a startup's LLM token costs by 90%, saving over $1 million a year. He did this largely by tackling a "very suboptimal architecture" in their RAG agent that was "burning tokens like crazy" at over $3,700 a day. This isn't an isolated incident; it's a common struggle. Many teams initially fixate on vector database costs, only to find the bigger bill comes from elsewhere.
True RAG cost optimization starts long before the LLM generates its first word. It begins with a ruthless focus on your embedding strategy and retrieval efficiency.
OpenAI's text-embedding-3-small model is priced at a seemingly negligible $0.02 per million tokens for standard use, or $0.01 per million tokens with their Batch API. text-embedding-3-large costs $0.13 per million tokens (or $0.065 with Batch API), and the legacy text-embedding-ada-002 is $0.10 per million tokens (or $0.05 with Batch API). These numbers appear trivial on their own. As one analysis points out, "Embedding a million documents at 500 tokens each costs about $10 with text-embedding-3-small. That's a rounding error."
But here's the catch: the problem isn't the per-token cost; it's the cumulative cost. You're embedding your entire corpus, re-embedding updates, and generating embeddings for every single user query. Most teams building RAG systems get the budget wrong by a factor of two or three because they fixate on these small per-token rates and "then act surprised when the real costs show up six months later in storage overruns, reindexing cycles, and an engineer spending half their week babysitting retrieval quality."
Consider a typical RAG setup:
Beyond the embedding models themselves, inefficient retrieval practices further inflate your RAG bill:
gpt-4o or Claude Opus 4.7. Simple factual questions can often be handled by cheaper, faster models. OpenAI's gpt-4o-mini is a far more cost-efficient alternative, priced at $0.15 input / $0.60 output per 1M tokens. Similarly, Anthropic's Claude Haiku 4.5 is priced at $1.00 input / $5.00 output per 1M tokens. These are significantly cheaper than gpt-4o ($2.50 input / $10.00 output per 1M tokens) or Claude Opus 4.7 ($5.00 input / $25.00 output per 1M tokens).The "latency iceberg" analogy holds true for costs too: the visible LLM inference is just the tip. Embedding operations and vector search are silent killers of performance and budget, sometimes accounting for a substantial portion of total latency.
The path to true RAG cost optimization isn't about chasing the absolute cheapest LLM for generation. It's about a holistic approach that prioritizes lean embedding choices and intelligent retrieval strategies first. I advocate for these principles, learned from experience:
Default to text-embedding-3-small. Always.
For 95% of use cases, the cheapest OpenAI embedding model, text-embedding-3-small ($0.02 per 1M tokens), offers an excellent cost-to-performance ratio. It's 6.5 times cheaper than text-embedding-3-large ($0.13 per 1M tokens) with marginal quality difference for most tasks. Test text-embedding-3-small on your actual retrieval tasks before considering a more expensive model. You'll likely be surprised by its capabilities.
Batch Aggressively for Ingestion and Queries.
Never embed documents one at a time. Utilize batching for both your initial corpus ingestion and for query embeddings when feasible. This significantly reduces API overhead. OpenAI's Batch API also provides a 50% discount for asynchronous embedding processing, completing jobs within 24 hours. Anthropic also offers a 50% discount for batch processing.
Optimize Retrieval Context, Not Just Token Count.
Quality beats quantity. Retrieve fewer, more relevant chunks. Implement dynamic context sizing to adjust retrieval based on query complexity. Simple, factual queries need minimal context; complex analytical questions might warrant more. Consider relevance thresholding to drop chunks below a certain similarity score, reducing the number of tokens sent to the LLM.
Implement Smart Model Routing for LLM Generation.
After optimizing your embeddings and retrieval, route simpler RAG queries to more cost-effective LLMs. For example, use OpenAI's gpt-4o-mini ($0.15 input / $0.60 output per 1M tokens) or Anthropic's Claude Haiku 4.5 ($1.00 input / $5.00 output per 1M tokens) for basic Q&A. Reserve gpt-4o ($2.50 input / $10.00 output per 1M tokens) or Claude Sonnet 4.6 ($3.00 input / $15.00 output per 1M tokens) for complex reasoning tasks. This can reduce LLM generation costs by over 50%. Prompt caching, especially for Anthropic models, can also provide up to 90% savings on repeated input contexts.
Measure and Monitor Continuously.
You can't optimize what you don't measure. Track embedding costs, retrieval latency, and LLM token usage across different parts of your RAG pipeline. This visibility surfaces anomalies early and identifies which architectural choices are truly driving costs. "The actual RAG implementation cost is shaped by decisions that interact across your entire stack: how you chunk documents, what embedding dimensions you choose, whether you rerank, how often your corpus changes, and whether anyone's tracking cost-per-query alongside latency."
For example, a RAG system that was initially burning $3,700 a day in token costs was optimized to cut 90% of its expenses. This was achieved by implementing caching, prompt routing to cheaper models, and replacing an expensive LLM-as-a-judge step with an open-source re-ranker model. These are real, tangible savings that go beyond just swapping out a GPT-4 call for a GPT-3.5 one.
Focusing on these often-overlooked areas of your RAG stack can significantly reduce your OpenAI API costs and other provider expenses, making your AI applications both performant and economical.
Want to cut your AI costs?
CostLens routes simple prompts to cheaper models automatically.