aifinopsllminferencecostproduction

What AI Inference Actually Costs in 2026: The Price Table Nobody Shows You

GPT-4.1 is $25/M input tokens. Mistral Small is $0.03. The range is 800x. Here is the full picture.

April 2, 2026 5 min read

On this page

The range is 800x

The most important thing about AI inference pricing in 2026 is not any single number. It is the range.

GPT-4.1 input: $25.00 per 1M tokens. Mistral Small 3.1: $0.03 per 1M tokens. That is an 800x spread before you touch orchestration, retries, or tool calls.

Most teams pick a model and never revisit the decision. That is how AI bills become surprises.

Current pricing (April 2026)

Provider	Model	Input/1M tokens	Output/1M tokens	Source
OpenAI	GPT-4.1	$25.00	$12.00	OpenAI Pricing
OpenAI	GPT-4.1 mini	$5.00	$3.20	OpenAI Pricing
OpenAI	GPT-4.1 nano	$1.50	$0.80	OpenAI Pricing
Anthropic	Claude Sonnet 4	$3.00	$15.00	Anthropic Pricing
Anthropic	Claude Opus 4.6	$5.00	$25.00	Anthropic Pricing
Google	Gemini 2.5 Pro	$1.25	$10.00-$15.00	Google Gemini Pricing
Mistral	Small 3.1	$0.03-$0.20	$0.11-$0.60	Mistral public pricing

Two things stand out. First, the range is massive. Second, output is often much more expensive than input. If your agent generates long answers, output dominates the bill.

Why agents make it worse

A single prompt is not a production workload. A production agent typically does this per user action:

Step	What happens	Tokens added
1	Read user request	~200
2	Classify intent	~500 input + ~100 output
3	Plan steps	~800 input + ~300 output
4	Call tool, read result	~1,000 input
5	Re-prompt with tool output	Full context again
6	Retry on failure	Everything again
7	Final answer	~500 output

One user action becomes 4-7 model calls. Each call includes the growing conversation. Context compounds.

Anthropic’s pricing docs are explicit: tool use adds system prompt tokens, and tool use requests are priced by total input tokens, output tokens, and server-side tool usage.

Example at GPT-4.1 pricing:

4 agent steps, 2 tool calls, 1 retry
~8,000 input tokens at $25/1M = $0.20
~700 output tokens at $12/1M = $0.008
Per task: ~$0.21
At 1,000 tasks/day: ~$210/day = $6,300/month

That is just model cost. Add retrieval, embedding, logging, and infrastructure.

The metric that matters: cost per task

Token cost is an engineering metric. Business cost is a product metric.

Wrong metric	Right metric
Cost per token	Cost per resolved ticket
Cost per API call	Cost per qualified lead
Monthly model spend	Cost per document processed
Average latency	Cost per successful outcome

A lower token rate is not a better product if it takes 3x as many calls to do the same job. A cheap model that increases retries can cost more than an expensive model that gets it right first time.

Four ways to cut costs that actually work

1. Caching

Highest-ROI optimization. Anthropic’s cache pricing: a cache hit is 10% of standard input price. Reuse long system prompts, policy text, product docs, and session history.

2. Model routing

Not every request needs the best model.

Task	Model class	Why
Classification	Nano/Small	Fast, cheap, low risk
Extraction	Small/Mid	High precision, short output
Drafting	Mid	Balance quality and cost
Hard reasoning	Frontier	Only when needed

3. Prompt optimization

Most prompts are bloated. Cut: repeated instructions, full conversation history when only the last turn matters, examples that do not improve accuracy, large schemas for tiny tasks.

4. Retry control

Retries are silent budget killers. Set hard limits: max retries per task, escalation after failure, fallback model, timeout budgets.

Self-hosting break-even

Factor	API wins	Self-hosting wins
Volume	Spiky, unpredictable	High and steady
Quality needs	Frontier required	Mid-tier acceptable
Team size	Small, no infra team	Has platform engineering
Product maturity	Early, still iterating	Stable, known workload
Latency	Acceptable	Critical

Self-hosting includes: GPU rental/depreciation, serving stack, autoscaling, observability, incident handling, model upgrades, security overhead. A cheap GPU that sits idle is expensive.

What to instrument

Track at minimum per request:

Field	Why
task_type	Know which workflows cost what
model	Catch model misrouting
prompt_tokens + completion_tokens	Raw cost calculation
tool_calls	Agent complexity
retries	Silent cost multiplier
cache_hit	Optimization effectiveness
outcome_status	Connect cost to value
latency	User experience impact

Without this, you know what the vendor charged you. You do not know what the product cost you.

The cost paradox

Inference is getting cheaper per token. Production spend often rises anyway.

Why? Because cheaper models encourage more usage. Teams add more steps, more context, more agents, more retries. Unit price falls. Total bill rises.

The FinOps Foundation’s AI guidance makes this point: AI needs the same cost discipline as cloud, but with different meters, faster cost shifts, and much weaker native visibility.

If you run AI in production, you need AI FinOps. Not as a spreadsheet exercise. As an engineering discipline.

René Murrell

AI Engineer · Berlin · Building in public

GitHub →