The range is 800x
The most important thing about AI inference pricing in 2026 is not any single number. It is the range.
GPT-4.1 input: $25.00 per 1M tokens. Mistral Small 3.1: $0.03 per 1M tokens. That is an 800x spread before you touch orchestration, retries, or tool calls.
Most teams pick a model and never revisit the decision. That is how AI bills become surprises.
Current pricing (April 2026)
| Provider | Model | Input/1M tokens | Output/1M tokens | Source |
|---|---|---|---|---|
| OpenAI | GPT-4.1 | $25.00 | $12.00 | OpenAI Pricing |
| OpenAI | GPT-4.1 mini | $5.00 | $3.20 | OpenAI Pricing |
| OpenAI | GPT-4.1 nano | $1.50 | $0.80 | OpenAI Pricing |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | Anthropic Pricing |
| Anthropic | Claude Opus 4.6 | $5.00 | $25.00 | Anthropic Pricing |
| Gemini 2.5 Pro | $1.25 | $10.00-$15.00 | Google Gemini Pricing | |
| Mistral | Small 3.1 | $0.03-$0.20 | $0.11-$0.60 | Mistral public pricing |
Two things stand out. First, the range is massive. Second, output is often much more expensive than input. If your agent generates long answers, output dominates the bill.
Why agents make it worse
A single prompt is not a production workload. A production agent typically does this per user action:
| Step | What happens | Tokens added |
|---|---|---|
| 1 | Read user request | ~200 |
| 2 | Classify intent | ~500 input + ~100 output |
| 3 | Plan steps | ~800 input + ~300 output |
| 4 | Call tool, read result | ~1,000 input |
| 5 | Re-prompt with tool output | Full context again |
| 6 | Retry on failure | Everything again |
| 7 | Final answer | ~500 output |
One user action becomes 4-7 model calls. Each call includes the growing conversation. Context compounds.
Anthropic’s pricing docs are explicit: tool use adds system prompt tokens, and tool use requests are priced by total input tokens, output tokens, and server-side tool usage.
Example at GPT-4.1 pricing:
- 4 agent steps, 2 tool calls, 1 retry
- ~8,000 input tokens at $25/1M = $0.20
- ~700 output tokens at $12/1M = $0.008
- Per task: ~$0.21
- At 1,000 tasks/day: ~$210/day = $6,300/month
That is just model cost. Add retrieval, embedding, logging, and infrastructure.
The metric that matters: cost per task
Token cost is an engineering metric. Business cost is a product metric.
| Wrong metric | Right metric |
|---|---|
| Cost per token | Cost per resolved ticket |
| Cost per API call | Cost per qualified lead |
| Monthly model spend | Cost per document processed |
| Average latency | Cost per successful outcome |
A lower token rate is not a better product if it takes 3x as many calls to do the same job. A cheap model that increases retries can cost more than an expensive model that gets it right first time.
Four ways to cut costs that actually work
1. Caching
Highest-ROI optimization. Anthropic’s cache pricing: a cache hit is 10% of standard input price. Reuse long system prompts, policy text, product docs, and session history.
2. Model routing
Not every request needs the best model.
| Task | Model class | Why |
|---|---|---|
| Classification | Nano/Small | Fast, cheap, low risk |
| Extraction | Small/Mid | High precision, short output |
| Drafting | Mid | Balance quality and cost |
| Hard reasoning | Frontier | Only when needed |
3. Prompt optimization
Most prompts are bloated. Cut: repeated instructions, full conversation history when only the last turn matters, examples that do not improve accuracy, large schemas for tiny tasks.
4. Retry control
Retries are silent budget killers. Set hard limits: max retries per task, escalation after failure, fallback model, timeout budgets.
Self-hosting break-even
| Factor | API wins | Self-hosting wins |
|---|---|---|
| Volume | Spiky, unpredictable | High and steady |
| Quality needs | Frontier required | Mid-tier acceptable |
| Team size | Small, no infra team | Has platform engineering |
| Product maturity | Early, still iterating | Stable, known workload |
| Latency | Acceptable | Critical |
Self-hosting includes: GPU rental/depreciation, serving stack, autoscaling, observability, incident handling, model upgrades, security overhead. A cheap GPU that sits idle is expensive.
What to instrument
Track at minimum per request:
| Field | Why |
|---|---|
| task_type | Know which workflows cost what |
| model | Catch model misrouting |
| prompt_tokens + completion_tokens | Raw cost calculation |
| tool_calls | Agent complexity |
| retries | Silent cost multiplier |
| cache_hit | Optimization effectiveness |
| outcome_status | Connect cost to value |
| latency | User experience impact |
Without this, you know what the vendor charged you. You do not know what the product cost you.
The cost paradox
Inference is getting cheaper per token. Production spend often rises anyway.
Why? Because cheaper models encourage more usage. Teams add more steps, more context, more agents, more retries. Unit price falls. Total bill rises.
The FinOps Foundation’s AI guidance makes this point: AI needs the same cost discipline as cloud, but with different meters, faster cost shifts, and much weaker native visibility.
If you run AI in production, you need AI FinOps. Not as a spreadsheet exercise. As an engineering discipline.