The range is 800x

The most important thing about AI inference pricing in 2026 is not any single number. It is the range.

GPT-4.1 input: $25.00 per 1M tokens. Mistral Small 3.1: $0.03 per 1M tokens. That is an 800x spread before you touch orchestration, retries, or tool calls.

Most teams pick a model and never revisit the decision. That is how AI bills become surprises.

Current pricing (April 2026)

ProviderModelInput/1M tokensOutput/1M tokensSource
OpenAIGPT-4.1$25.00$12.00OpenAI Pricing
OpenAIGPT-4.1 mini$5.00$3.20OpenAI Pricing
OpenAIGPT-4.1 nano$1.50$0.80OpenAI Pricing
AnthropicClaude Sonnet 4$3.00$15.00Anthropic Pricing
AnthropicClaude Opus 4.6$5.00$25.00Anthropic Pricing
GoogleGemini 2.5 Pro$1.25$10.00-$15.00Google Gemini Pricing
MistralSmall 3.1$0.03-$0.20$0.11-$0.60Mistral public pricing

Two things stand out. First, the range is massive. Second, output is often much more expensive than input. If your agent generates long answers, output dominates the bill.

Why agents make it worse

A single prompt is not a production workload. A production agent typically does this per user action:

StepWhat happensTokens added
1Read user request~200
2Classify intent~500 input + ~100 output
3Plan steps~800 input + ~300 output
4Call tool, read result~1,000 input
5Re-prompt with tool outputFull context again
6Retry on failureEverything again
7Final answer~500 output

One user action becomes 4-7 model calls. Each call includes the growing conversation. Context compounds.

Anthropic’s pricing docs are explicit: tool use adds system prompt tokens, and tool use requests are priced by total input tokens, output tokens, and server-side tool usage.

Example at GPT-4.1 pricing:

  • 4 agent steps, 2 tool calls, 1 retry
  • ~8,000 input tokens at $25/1M = $0.20
  • ~700 output tokens at $12/1M = $0.008
  • Per task: ~$0.21
  • At 1,000 tasks/day: ~$210/day = $6,300/month

That is just model cost. Add retrieval, embedding, logging, and infrastructure.

The metric that matters: cost per task

Token cost is an engineering metric. Business cost is a product metric.

Wrong metricRight metric
Cost per tokenCost per resolved ticket
Cost per API callCost per qualified lead
Monthly model spendCost per document processed
Average latencyCost per successful outcome

A lower token rate is not a better product if it takes 3x as many calls to do the same job. A cheap model that increases retries can cost more than an expensive model that gets it right first time.

Four ways to cut costs that actually work

1. Caching

Highest-ROI optimization. Anthropic’s cache pricing: a cache hit is 10% of standard input price. Reuse long system prompts, policy text, product docs, and session history.

2. Model routing

Not every request needs the best model.

TaskModel classWhy
ClassificationNano/SmallFast, cheap, low risk
ExtractionSmall/MidHigh precision, short output
DraftingMidBalance quality and cost
Hard reasoningFrontierOnly when needed

3. Prompt optimization

Most prompts are bloated. Cut: repeated instructions, full conversation history when only the last turn matters, examples that do not improve accuracy, large schemas for tiny tasks.

4. Retry control

Retries are silent budget killers. Set hard limits: max retries per task, escalation after failure, fallback model, timeout budgets.

Self-hosting break-even

FactorAPI winsSelf-hosting wins
VolumeSpiky, unpredictableHigh and steady
Quality needsFrontier requiredMid-tier acceptable
Team sizeSmall, no infra teamHas platform engineering
Product maturityEarly, still iteratingStable, known workload
LatencyAcceptableCritical

Self-hosting includes: GPU rental/depreciation, serving stack, autoscaling, observability, incident handling, model upgrades, security overhead. A cheap GPU that sits idle is expensive.

What to instrument

Track at minimum per request:

FieldWhy
task_typeKnow which workflows cost what
modelCatch model misrouting
prompt_tokens + completion_tokensRaw cost calculation
tool_callsAgent complexity
retriesSilent cost multiplier
cache_hitOptimization effectiveness
outcome_statusConnect cost to value
latencyUser experience impact

Without this, you know what the vendor charged you. You do not know what the product cost you.

The cost paradox

Inference is getting cheaper per token. Production spend often rises anyway.

Why? Because cheaper models encourage more usage. Teams add more steps, more context, more agents, more retries. Unit price falls. Total bill rises.

The FinOps Foundation’s AI guidance makes this point: AI needs the same cost discipline as cloud, but with different meters, faster cost shifts, and much weaker native visibility.

If you run AI in production, you need AI FinOps. Not as a spreadsheet exercise. As an engineering discipline.