aillmhallucinationsbenchmarkingrisk-managementragenterprise-ai

Hallucination Rates Dropped From 20% to Under 4%. Most Risk Assessments Still Assume 2023 Numbers.

On harder benchmarks, frontier models now hallucinate below 4%. Many enterprise AI policies have not caught up.

April 2, 2026 9 min read

On this page

The number most teams missed

The headline number is not that AI still hallucinates.

The headline number is that, on Vectara’s current hallucination leaderboard (updated March 20, 2026), the best-performing frontier models now score below 4% on summarization factual consistency. That is a harder benchmark than the one that produced the widely cited sub-1% numbers from 2025.

Here is the current state:

Model	Hallucination rate	Benchmark
antgroup/finix_s1_32b	1.8%	Current (7,700 articles, HHEM-2.3)
openai/gpt-5.4-nano	3.1%	Current
google/gemini-2.5-flash-lite	3.3%	Current
openai/gpt-4.1	5.6%	Current
google/gemini-2.5-flash	7.8%	Current
anthropic/claude-sonnet-4	10.3%	Current

Source: Vectara Hallucination Leaderboard, GitHub

That matters because a lot of internal AI risk reviews are still written as if the default failure rate is stuck somewhere between 15% and 25%. They are not. The best frontier models are now an order of magnitude better than where the field started.

If your organization still treats all LLMs as equally unreliable, you are probably making bad build-versus-buy decisions, overestimating operational risk in some places, and underestimating it in others.

A note on the widely cited 0.7% number

In late 2025, Gemini 2.0 Flash scored 0.7% on Vectara’s original leaderboard, which used 1,000 short documents from the CNN/Daily Mail corpus. That number was real, but the benchmark was easier.

Vectara refreshed the leaderboard in November 2025, replacing the 1,000-document dataset with over 7,700 articles spanning technology, science, medicine, law, sports, business, and education. They also upgraded the evaluator from HHEM-2.1 to HHEM-2.3. The reason: models had improved so much that the old dataset showed clustering at the top and stopped differentiating meaningfully.

On the new, harder benchmark, rates roughly doubled or tripled for the same models. GPT-4.1 went from 2.0% (old) to 5.6% (new). Claude Sonnet 4 went from 4.5% to 10.3%.

That means the 0.7% and the current 3-5% numbers are not directly comparable. The improvement is real, but the trajectory is better understood as “from 20%+ to low single digits on a harder benchmark,” not “from 20% to sub-1%.”

This distinction matters because citing the 0.7% without context is exactly the kind of imprecise risk communication this article argues against.

How hallucinations are actually measured

A lot of debate around hallucinations is messy because people measure different things.

Vectara is explicit about their methodology. Their leaderboard does not measure general intelligence or broad factual accuracy. It measures summarization factual consistency:

Step	What happens
1	The model receives a source document
2	It is asked to summarize only the facts in that document
3	The output runs through HHEM (Hallucination Evaluation Model)
4	HHEM scores each summary 0 to 1. Values below 0.5 count as hallucinated

Source: Vectara blog

A model can be excellent at summarization and still be bad at free-form legal advice, medical triage, or financial reasoning. Summarization benchmarks are useful because they are controlled. They are not the same as production.

Other benchmark families take different approaches. TRUE, TrueTeacher, ALIGNSCORE, MiniCheck, RAGTruth, and FaithBench are all part of the broader evaluation ecosystem for hallucination detection. Stanford and academic papers in legal and medical domains often use human expert annotation, because domain error can be legally or clinically significant.

So when someone says “the hallucination rate is X%,” the first question should be: on what task, with what evaluator, and on what dataset? That is not a gotcha. It is the whole point.

What RAG changes, and what it does not

Retrieval-augmented generation helps because it gives the model actual source material to work from, rather than generating from parametric memory alone.

Vectara’s leaderboard is relevant here because it tests summarization over provided documents. That makes it a reasonable proxy for RAG-like enterprise workflows where the model distills source context rather than inventing.

A 2025 JMIR Cancer study (62 cancer-related questions, reviewed by 2 clinicians) tested GPT-4 and GPT-3.5 with and without retrieval:

Setup	Hallucination rate
GPT-4 with cancer-specific reference sources	0%
GPT-3.5 with cancer-specific reference sources	6%
GPT-4 with Google search results	6%
GPT-3.5 with Google search results	10%
Conventional chatbots without retrieval	~40%

The same study found that RAG-based systems sometimes refused to answer when information was missing, which reduced hallucination but also reduced response rate.

That is the tradeoff most internal discussions miss. RAG can reduce hallucinations, but it often does so by constraining the model to the retrieved context, making it more likely to say “I don’t know,” and lowering coverage on edge cases. That is still a good trade in many enterprise settings. It is not a good trade if your product promise requires every question to get an answer.

The domain gap is the real story

The biggest mistake in many risk reviews is treating all use cases as if they live on the same curve. They do not.

Domain	Hallucination rate	Source
General summarization (frontier models)	3-6%	Vectara leaderboard, March 2026
General summarization (mid-tier models)	7-12%	Vectara leaderboard, March 2026
Legal AI research tools	17-33%	Stanford HAI, study by Magesh et al.
Medical with strong RAG	0-6%	JMIR Cancer, 2025
Medical without retrieval	~40%	JMIR Cancer, 2025

Legal

Stanford HAI’s legal benchmark tested three commercial legal AI products (Lexis+ AI, Westlaw AI-Assisted Research, Ask Practical Law AI). Each hallucinated in 17% to 33% of benchmarking queries. That is roughly 1 in 6 to 1 in 3. The study also found that vendors claiming “hallucination-free” status could not substantiate those claims in closed systems.

Source: Magesh et al., arXiv:2405.20362

Medical

The JMIR Cancer study showed how quickly rates move depending on reference quality. With reliable cancer-specific sources, hallucination dropped to zero for GPT-4. With weaker retrieval, rates rose. Conventional chatbot behavior was much worse. Medical use cases are therefore not “hallucination solved” just because a vendor added RAG.

Finance

Finance is harder to pin to a single rate because benchmarks vary by task, but it behaves like legal and medical in one important respect: errors are expensive, and the cost of a wrong answer is often asymmetric. A model that is “mostly right” on general content can still be unacceptable for disclosures, filings, trade research, or customer advice.

Why organizations are still stuck on outdated numbers

Reason	What happens
Risk policies age slower than models	First AI guidance was written when hallucination stories were everywhere. Model quality moved faster than policy review cycles
Teams confuse model capability with product risk	A low benchmark does not automatically mean low product risk. Risk also depends on retrieval quality, prompt design, citation handling, UI, and human review
People remember the worst demo	A single embarrassing hallucination dominates decision-making, even if the workload is now 95% source-grounded summarization
Vendors overclaim, buyers overreact	”Hallucination-free” claims cannot be verified in closed systems. The correct response is measured evaluation, not cynicism

What this means for decision makers

The practical response is not “trust AI now.” The practical response is “update the assessment to match the task.”

Risk tiers by use case

Risk tier	Examples	Typical hallucination exposure
Low-risk drafting support	Internal summaries, note-taking, first-pass rewriting, search condensation	3-6% on frontier models, further reducible with RAG
Moderate-risk assistive workflows	Customer support, sales ops, internal knowledge assistants, code review suggestions	Depends heavily on retrieval quality and human review
High-risk regulated workflows	Legal research, medical guidance, financial advice, policy decisions, citation-sensitive content	17-33% for legal tools, variable for medical/finance. Human review mandatory

A model that is acceptable in the first tier may be unacceptable in the third.

Evaluate the whole system, not just the model

Your actual error rate depends on which model you use, whether retrieval is high quality, whether the model can cite sources, whether the UI separates sources from generated text, whether humans approve final outputs, and whether the system allows refusal.

Keep the benchmark honest

When a vendor shows one number, ask:

Question	Why it matters
What task is this?	Summarization, QA, and free-form generation have different failure modes
What is the evaluator?	HHEM, human annotation, and self-evaluation produce different results
How large is the dataset?	1,000 documents vs 7,700 changes the difficulty
What is the refusal rate?	Low hallucination + high refusal = less useful than the number suggests
Does the benchmark reflect my domain?	General summarization rates do not predict legal or medical performance

Where hallucinations still kill the deal

Scenario	Why it is still a dealbreaker
Citation-sensitive legal work	Hallucinated citations can be worse than no answer. 17-33% error rate on tested legal tools
Medical guidance without tight retrieval	Low average rate can hide catastrophic tail risk. Misinformation does not need to be frequent to be dangerous
Financial recommendations and compliance	”Mostly correct” is insufficient for disclosures, filings, or customer advice
Autonomous agents with tools	Wrong answer + ability to act = writes to ticketing system, sends email, triggers workflow
Any workflow where users will not verify	If the end user is unlikely to check the output, the effective hallucination rate is higher than the benchmark

The real conclusion

The lesson is not that hallucinations are gone. The lesson is that many organizations are still budgeting for a 2023 failure mode in a 2026 model landscape.

A mature AI risk assessment should now distinguish between:

Factor	Why it matters
Model quality	Best frontier models are at 3-6% on summarization, not 20%
Task type	Summarization, legal QA, medical advice, and free-form generation have wildly different rates
Retrieval quality	Strong RAG can push rates to 0-6%. Weak retrieval is barely better than no retrieval
Refusal behavior	A model that refuses uncertain answers is safer but less useful
Domain sensitivity	Legal is at 17-33%. Medical without retrieval is at ~40%. General summarization is at 3-6%
User verification	If nobody checks the output, the effective rate is higher

If your current policy still says “LLMs hallucinate too much” without naming the task, it is out of date. And if your procurement process is still using 2023 assumptions, you are not being cautious. You are being imprecise.

The data has moved. The risk assessment should too.

Sources

Source	URL
Vectara Hallucination Leaderboard (current, March 2026)	github.com/vectara
Vectara blog, “Next Generation Hallucination Leaderboard” (Nov 2025)	vectara.com
Stanford HAI, “AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More)“	hai.stanford.edu
Magesh et al., “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools”	arxiv.org
JMIR Cancer 2025, RAG and hallucination in cancer chatbots	pubmed.ncbi.nlm.nih.gov

René Murrell

AI Engineer · Berlin · Building in public

GitHub →