Stop Burning Cash: Slash LLM API Costs with Smart Routing and Semantic Caching
We were overpaying for redundant LLM calls. Here's how dynamic routing and semantic caching cut our API costs by up to 80%.
Every developer diving deep into building with Large Language Models (LLMs) eventually hits the same wall: the API bill. It's a painful reality, watching those costs climb as your application scales. We've all felt the sting – from "Why is GPT-4o so expensive for simple prompts?" to "Our RAG system is bleeding money on repeated queries." If you're nodding along, know this: you're not alone.
I remember a discussion on Hacker News where a developer asked, "How can ChatGPT serve 700M users when I can't run one GPT-4 locally?". While the conversation focused on raw infrastructure, the underlying truth resonated deeply with us: efficient LLM usage at scale demands more than just throwing requests at the most powerful model. That same logic applies directly to your API bill. My team realized we were significantly overpaying for redundant, low-value LLM calls.
The Hidden Culprit: Overpaying for Repetitive Calls
Let's be blunt: you're probably burning cash by sending identical or semantically similar requests to powerful, expensive LLMs like OpenAI's GPT-5.4 or Anthropic's Claude Opus 4.7. This was a hard lesson for us, but once we saw it, it was undeniable. Think about common scenarios in many applications:
- RAG systems: Multiple users often ask very similar questions over the same set of documents. Why pay for repeated retrieval and generation if the core intent and context are the same?
- User support chatbots: FAQs, common troubleshooting steps, and repetitive user inquiries can hit your LLM API countless times a day.
- Data processing pipelines: Extracting the same entity type from a batch of similar documents or categorizing content that largely overlaps.
The default developer instinct is often to send every request to the most capable (and thus, most expensive) model. This is a critical mistake. As Adnan Masood, PhD, aptly pointed out on Medium, common anti-patterns include "No caching despite repeat prompts" and "Single-model for everything".
Consider the providers' own incentives. OpenAI's GPT-5.4 costs $2.50 per million input tokens, but its "Cached input" rate drops to a mere $0.25 per million tokens – a 90% discount. Anthropic offers a similar 90% discount on prompt cache reads for models like Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7. These aren't minor savings; they're massive levers for cost control. Yet, many teams aren't fully leveraging these mechanisms, often due to perceived implementation complexity.
The Solution: Dynamic Routing & Semantic Caching
The real answer to this cost bloat isn't just blindly switching to a "cheaper alternative" for every task. It's about building intelligent infrastructure: dynamic model routing combined with semantic caching. This combination allows you to pay only for the "thinking" that's genuinely new and valuable.
1. Dynamic Model Routing: Send the Right Request to the Right Model
Dynamic routing means automatically directing each LLM query to the most cost-effective model that can still deliver the required quality for that specific task. Not every request needs the immense reasoning power of GPT-5.4 or Claude Opus 4.7.
- Simple classifications, summarizations, or entity extractions can often be handled perfectly by smaller, faster, and significantly cheaper models. For example, OpenAI's GPT-5.4 mini charges $0.75 per million input tokens, or the highly efficient GPT-4o-mini at $0.15 per million input tokens. For many tasks, we've found GPT-4o-mini to be a superior replacement for GPT-3.5 Turbo, offering better performance at a lower cost.
- Complex reasoning, multi-turn conversations, or demanding code generation might genuinely require a flagship model like GPT-5.4 ($2.50 input / $15.00 output per MTok) or Claude Opus 4.7 ($5.00 input / $25.00 output per MTok).
We've seen that even a basic routing strategy can drastically reduce costs. For instance, moving appropriate tasks from a flagship model to a smaller model can reduce costs by 90% or more, without a noticeable drop in quality for the end-user. The cost difference is stark: GPT-5.4 is $2.50/1M input tokens, while GPT-5.4 mini is $0.75/1M input tokens, and GPT-4o-mini is $0.15/1M input tokens. This isn't theoretical; it's a proven strategy we've implemented for "llm cost optimization."
2. Semantic Caching: Eliminate Redundant API Calls
Semantic caching takes traditional caching a crucial step further. Instead of just matching exact strings, it understands the meaning of a query. If a new request is semantically similar to one already processed and cached, the system returns the cached response instead of making another expensive LLM API call. This has been a game-changer for us in reducing our "openai api costs" and Anthropic bills.
For example, Anthropic offers a 90% discount on prompt cache reads. This is the kind of concrete saving that directly impacts your bottom line.
How it works in practice:
- Incoming Query: A user sends a request to your LLM-powered application.
- Semantic Similarity Check: Our internal LLM gateway first checks if a semantically similar query has been processed recently and stored in the cache.
- Cache Hit (Up to 90% Savings): If a sufficiently similar query is found, the cached response is immediately returned. This completely bypasses an expensive API call to the LLM provider. Both OpenAI and Anthropic offer significant discounts for using cached inputs.
- Cache Miss & Model Routing: If no semantic match is found, our dynamic router steps in to decide the optimal LLM based on the query's complexity, urgency, and required capabilities.
- API Call & Cache Update: The chosen LLM generates a response, which is then stored in the semantic cache for future similar queries.
Our Take: Stop Paying for "Thinking" You've Already Paid For
My team found many developers, ourselves included initially, were caught in a cycle of overspending because our LLM architecture treated every prompt as a unique, high-value task demanding the most expensive model. This completely overlooked the significant overlap and repetition inherent in real-world applications.
Our core goal became ensuring that every dollar spent on LLM APIs contributes to genuinely new, valuable output. If an LLM is "thinking" about something it has already "thought" about (or a much cheaper model could easily handle), you're simply wasting money.
Real Numbers (May 2026):
- OpenAI GPT-5.4 Input: $2.50 / 1M tokens
- OpenAI GPT-5.4 Cached Input: $0.25 / 1M tokens (90% savings!)
- Anthropic Claude Sonnet 4.6 Input: $3.00 / 1M tokens
- Anthropic Claude Sonnet 4.6 Cached Input: $0.30 / 1M tokens (90% savings!)
- OpenAI GPT-4o-mini vs. GPT-5.4: For tasks where GPT-4o-mini is sufficient, its input price of $0.15/1M tokens makes it up to 94% cheaper than GPT-5.4's $2.50/1M input tokens.
The evidence is overwhelming: ignoring dynamic model routing and robust semantic caching means leaving massive cost savings on the table.
How We Tackled Overspending: Building Our LLM Optimization Layer
This problem was so pressing that my team built an internal LLM optimization layer. It wasn't about a fancy new product, but about solving a real-world engineering challenge. This layer sits in front of our LLM API calls, giving us the control and intelligence we desperately needed without complex infrastructure overhead.
Here's how it works for us:
- Real-time LLM Cost Tracking: We get immediate visibility into token usage and costs, broken down by model, user, and even specific application features. This quickly highlighted where redundant calls were happening.
- Dynamic Model Routing: We configured rules to automatically route requests to the most appropriate model. This means simple classification tasks go to a cheaper, faster model like GPT-4o-mini or Claude Haiku 4.5, while complex agentic workflows hit GPT-5.4 or Claude Opus 4.7.
- Semantic Prompt Caching: Our system implements a robust semantic cache. It intercepts incoming requests, performs a semantic similarity check against previous interactions, and if a match is found, serves a cached response. This dramatically reduces our actual calls to OpenAI and Anthropic APIs.
By implementing these strategies with our internal toolkit, we've consistently seen our LLM API bills cut by 40-80%, often with no perceptible difference in user experience. It required some upfront engineering, but the ROI has been phenomenal.
Stop accepting inflated bills. It's time to implement intelligent strategies that truly optimize your LLM spend, ensuring every token counts.
Sources
Want to cut your AI costs?
CostLens routes simple prompts to cheaper models automatically.