How to Reduce OpenAI API Costs in 2026 (We Cut Ours by 43%)
5 techniques that took our OpenAI bill from $4,100 to $2,337/month. Code examples, real numbers, no fluff.
In January we were spending $4,100/month on OpenAI. By March, we'd cut it to $2,337 without dropping a single feature or degrading quality. Here's exactly what we did.
The Problem
Our app made ~80K OpenAI requests/month. Everything went to GPT-5.4 because it was "good enough" and nobody had time to optimize. Classic developer move — ship first, optimize later.
Then we actually looked at the data:
- 52% of requests were simple tasks (summarization, classification, extraction)
- 31% were moderate (content generation, Q&A)
- 17% were genuinely complex (code generation, multi-step reasoning)
We were paying $2.50/$15.00 per million tokens for tasks that a $0.05/$0.40 model handles perfectly. That's a 50x overpay on half our traffic.
1. Smart Model Routing (Saved 34%)
The biggest win. Route simple prompts to cheap models, keep complex ones on GPT-5.4.
import { CostLens } from 'costlens';
import OpenAI from 'openai';
const costlens = new CostLens({
apiKey: process.env.COSTLENS_API_KEY,
smartRouting: true
});
const openai = costlens.wrapOpenAI(new OpenAI());
// Your code doesn't change. CostLens intercepts and routes:
const resp openai.chat.completions.create({
model: 'gpt-5.4',
messages: [{ role: 'user', content: 'Classify this support ticket: ...' }]
});
// → Routed to GPT-4.1-nano. Same result. $0.05 instead of $2.50 per 1M input.
Our routing breakdown after optimization:
- GPT-4.1-nano (simple): 42K requests → $84/month
- GPT-5.4-mini (moderate): 25K requests → $563/month
- GPT-5.4 (complex): 13K requests → $780/month
Before: $4,100. After routing: $1,427 on model costs alone.
2. Response Caching (Saved 12%)
We had thousands of near-identical requests. Same system prompt + similar user inputs = same output. Why pay twice?
const costlens = new CostLens({
apiKey: process.env.COSTLENS_API_KEY,
enableCache: true, // Semantic similarity matching
smartRouting: true
});
Our cache hit rate stabilized at ~18% after a week. That's 18% of requests that cost zero tokens.
A developer on HN reported similar results: "We added semantic caching and our bill dropped 15% overnight. The embarrassing part is how many identical requests we were making."
3. Prompt Optimization (Saved 8%)
Shorter prompts = fewer input tokens. We audited our top 10 prompts by volume:
Before (our support classifier — 180 tokens):
You are an expert customer support classifier. Your job is to analyze incoming support tickets and categorize them into the appropriate category. Please read the following ticket carefully and determine which category it belongs to. The categories are: billing, technical, feature-request, bug-report, general. Respond with only the category name.
After (42 tokens):
Classify this support ticket into one category: billing, technical, feature-request, bug-report, general. Reply with the category only.
Same accuracy. 77% fewer input tokens on our highest-volume prompt. At 42K requests/month, that adds up.
4. Batch API for Non-Urgent Work (Saved 6%)
OpenAI's Batch API gives 50% off for requests that can wait up to 24 hours. We moved our nightly report generation and content indexing to batch:
// Batch requests get 50% discount
const batch = await openai.batches.create({
input_file_id: file.id,
endpoint: '/v1/chat/completions',
completion_window: '24h'
});
~8K requests/month moved to batch = 50% off those requests.
5. Budget Alerts + Kill Switch (Saved Us From a $2K Spike)
Not a cost reduction technique per se, but it prevented a disaster. One of our background jobs had a bug that caused infinite retries. Without alerts, it would have burned through $2K+ in a weekend.
We set daily budget caps per API key:
- Production key: $150/day max
- Development key: $20/day max
- Background jobs: $50/day max
When the runaway job hit $50, it got auto-blocked and we got a Slack alert. Fixed the bug Monday morning. Total damage: $50 instead of $2,000+.
Results Summary
| Technique | Monthly Savings | Effort |
|---|---|---|
| Smart routing | $1,394 (34%) | 10 min (SDK install) |
| Response caching | $492 (12%) | 5 min (config flag) |
| Prompt optimization | $328 (8%) | 2 hours (audit top prompts) |
| Batch API | $246 (6%) | 1 hour (move non-urgent jobs) |
| Total | $2,460 (43%) | ~3.5 hours |
Final bill: $2,337/month (down from $4,100).
For Coding Agent Users
If you use Claude Code, Kiro, or Cursor, you're probably burning through tokens without visibility. Our MCP server shows your spend in real-time inside the agent:
npx @lens360/mcp-server
It tracks cost per session, suggests cheaper models for simple prompts, and gives you a get_spend_summary tool right in your editor. No dashboard tab-switching needed.
Start Here
The highest-ROI move is routing. Install the SDK, enable smart routing, and you'll see 30%+ savings within a day:
npm install costlens
import { CostLens } from 'costlens';
const costlens = new CostLens({ apiKey: 'cl_...' });
const openai = costlens.wrapOpenAI(new OpenAI());
// Done. Every request is now cost-optimized.
See your cost breakdown in 2 minutes. Start free →
Want to cut your AI costs?
CostLens routes simple prompts to cheaper models automatically.