The number most teams missed
The headline number is not that AI still hallucinates.
The headline number is that, on Vectara’s current hallucination leaderboard (updated March 20, 2026), the best-performing frontier models now score below 4% on summarization factual consistency. That is a harder benchmark than the one that produced the widely cited sub-1% numbers from 2025.
Here is the current state:
| Model | Hallucination rate | Benchmark |
|---|---|---|
| antgroup/finix_s1_32b | 1.8% | Current (7,700 articles, HHEM-2.3) |
| openai/gpt-5.4-nano | 3.1% | Current |
| google/gemini-2.5-flash-lite | 3.3% | Current |
| openai/gpt-4.1 | 5.6% | Current |
| google/gemini-2.5-flash | 7.8% | Current |
| anthropic/claude-sonnet-4 | 10.3% | Current |
Source: Vectara Hallucination Leaderboard, GitHub
That matters because a lot of internal AI risk reviews are still written as if the default failure rate is stuck somewhere between 15% and 25%. They are not. The best frontier models are now an order of magnitude better than where the field started.
If your organization still treats all LLMs as equally unreliable, you are probably making bad build-versus-buy decisions, overestimating operational risk in some places, and underestimating it in others.
A note on the widely cited 0.7% number
In late 2025, Gemini 2.0 Flash scored 0.7% on Vectara’s original leaderboard, which used 1,000 short documents from the CNN/Daily Mail corpus. That number was real, but the benchmark was easier.
Vectara refreshed the leaderboard in November 2025, replacing the 1,000-document dataset with over 7,700 articles spanning technology, science, medicine, law, sports, business, and education. They also upgraded the evaluator from HHEM-2.1 to HHEM-2.3. The reason: models had improved so much that the old dataset showed clustering at the top and stopped differentiating meaningfully.
On the new, harder benchmark, rates roughly doubled or tripled for the same models. GPT-4.1 went from 2.0% (old) to 5.6% (new). Claude Sonnet 4 went from 4.5% to 10.3%.
That means the 0.7% and the current 3-5% numbers are not directly comparable. The improvement is real, but the trajectory is better understood as “from 20%+ to low single digits on a harder benchmark,” not “from 20% to sub-1%.”
This distinction matters because citing the 0.7% without context is exactly the kind of imprecise risk communication this article argues against.
How hallucinations are actually measured
A lot of debate around hallucinations is messy because people measure different things.
Vectara is explicit about their methodology. Their leaderboard does not measure general intelligence or broad factual accuracy. It measures summarization factual consistency:
| Step | What happens |
|---|---|
| 1 | The model receives a source document |
| 2 | It is asked to summarize only the facts in that document |
| 3 | The output runs through HHEM (Hallucination Evaluation Model) |
| 4 | HHEM scores each summary 0 to 1. Values below 0.5 count as hallucinated |
Source: Vectara blog
A model can be excellent at summarization and still be bad at free-form legal advice, medical triage, or financial reasoning. Summarization benchmarks are useful because they are controlled. They are not the same as production.
Other benchmark families take different approaches. TRUE, TrueTeacher, ALIGNSCORE, MiniCheck, RAGTruth, and FaithBench are all part of the broader evaluation ecosystem for hallucination detection. Stanford and academic papers in legal and medical domains often use human expert annotation, because domain error can be legally or clinically significant.
So when someone says “the hallucination rate is X%,” the first question should be: on what task, with what evaluator, and on what dataset? That is not a gotcha. It is the whole point.
What RAG changes, and what it does not
Retrieval-augmented generation helps because it gives the model actual source material to work from, rather than generating from parametric memory alone.
Vectara’s leaderboard is relevant here because it tests summarization over provided documents. That makes it a reasonable proxy for RAG-like enterprise workflows where the model distills source context rather than inventing.
A 2025 JMIR Cancer study (62 cancer-related questions, reviewed by 2 clinicians) tested GPT-4 and GPT-3.5 with and without retrieval:
| Setup | Hallucination rate |
|---|---|
| GPT-4 with cancer-specific reference sources | 0% |
| GPT-3.5 with cancer-specific reference sources | 6% |
| GPT-4 with Google search results | 6% |
| GPT-3.5 with Google search results | 10% |
| Conventional chatbots without retrieval | ~40% |
The same study found that RAG-based systems sometimes refused to answer when information was missing, which reduced hallucination but also reduced response rate.
That is the tradeoff most internal discussions miss. RAG can reduce hallucinations, but it often does so by constraining the model to the retrieved context, making it more likely to say “I don’t know,” and lowering coverage on edge cases. That is still a good trade in many enterprise settings. It is not a good trade if your product promise requires every question to get an answer.
The domain gap is the real story
The biggest mistake in many risk reviews is treating all use cases as if they live on the same curve. They do not.
| Domain | Hallucination rate | Source |
|---|---|---|
| General summarization (frontier models) | 3-6% | Vectara leaderboard, March 2026 |
| General summarization (mid-tier models) | 7-12% | Vectara leaderboard, March 2026 |
| Legal AI research tools | 17-33% | Stanford HAI, study by Magesh et al. |
| Medical with strong RAG | 0-6% | JMIR Cancer, 2025 |
| Medical without retrieval | ~40% | JMIR Cancer, 2025 |
Legal
Stanford HAI’s legal benchmark tested three commercial legal AI products (Lexis+ AI, Westlaw AI-Assisted Research, Ask Practical Law AI). Each hallucinated in 17% to 33% of benchmarking queries. That is roughly 1 in 6 to 1 in 3. The study also found that vendors claiming “hallucination-free” status could not substantiate those claims in closed systems.
Source: Magesh et al., arXiv:2405.20362
Medical
The JMIR Cancer study showed how quickly rates move depending on reference quality. With reliable cancer-specific sources, hallucination dropped to zero for GPT-4. With weaker retrieval, rates rose. Conventional chatbot behavior was much worse. Medical use cases are therefore not “hallucination solved” just because a vendor added RAG.
Finance
Finance is harder to pin to a single rate because benchmarks vary by task, but it behaves like legal and medical in one important respect: errors are expensive, and the cost of a wrong answer is often asymmetric. A model that is “mostly right” on general content can still be unacceptable for disclosures, filings, trade research, or customer advice.
Why organizations are still stuck on outdated numbers
| Reason | What happens |
|---|---|
| Risk policies age slower than models | First AI guidance was written when hallucination stories were everywhere. Model quality moved faster than policy review cycles |
| Teams confuse model capability with product risk | A low benchmark does not automatically mean low product risk. Risk also depends on retrieval quality, prompt design, citation handling, UI, and human review |
| People remember the worst demo | A single embarrassing hallucination dominates decision-making, even if the workload is now 95% source-grounded summarization |
| Vendors overclaim, buyers overreact | ”Hallucination-free” claims cannot be verified in closed systems. The correct response is measured evaluation, not cynicism |
What this means for decision makers
The practical response is not “trust AI now.” The practical response is “update the assessment to match the task.”
Risk tiers by use case
| Risk tier | Examples | Typical hallucination exposure |
|---|---|---|
| Low-risk drafting support | Internal summaries, note-taking, first-pass rewriting, search condensation | 3-6% on frontier models, further reducible with RAG |
| Moderate-risk assistive workflows | Customer support, sales ops, internal knowledge assistants, code review suggestions | Depends heavily on retrieval quality and human review |
| High-risk regulated workflows | Legal research, medical guidance, financial advice, policy decisions, citation-sensitive content | 17-33% for legal tools, variable for medical/finance. Human review mandatory |
A model that is acceptable in the first tier may be unacceptable in the third.
Evaluate the whole system, not just the model
Your actual error rate depends on which model you use, whether retrieval is high quality, whether the model can cite sources, whether the UI separates sources from generated text, whether humans approve final outputs, and whether the system allows refusal.
Keep the benchmark honest
When a vendor shows one number, ask:
| Question | Why it matters |
|---|---|
| What task is this? | Summarization, QA, and free-form generation have different failure modes |
| What is the evaluator? | HHEM, human annotation, and self-evaluation produce different results |
| How large is the dataset? | 1,000 documents vs 7,700 changes the difficulty |
| What is the refusal rate? | Low hallucination + high refusal = less useful than the number suggests |
| Does the benchmark reflect my domain? | General summarization rates do not predict legal or medical performance |
Where hallucinations still kill the deal
| Scenario | Why it is still a dealbreaker |
|---|---|
| Citation-sensitive legal work | Hallucinated citations can be worse than no answer. 17-33% error rate on tested legal tools |
| Medical guidance without tight retrieval | Low average rate can hide catastrophic tail risk. Misinformation does not need to be frequent to be dangerous |
| Financial recommendations and compliance | ”Mostly correct” is insufficient for disclosures, filings, or customer advice |
| Autonomous agents with tools | Wrong answer + ability to act = writes to ticketing system, sends email, triggers workflow |
| Any workflow where users will not verify | If the end user is unlikely to check the output, the effective hallucination rate is higher than the benchmark |
The real conclusion
The lesson is not that hallucinations are gone. The lesson is that many organizations are still budgeting for a 2023 failure mode in a 2026 model landscape.
A mature AI risk assessment should now distinguish between:
| Factor | Why it matters |
|---|---|
| Model quality | Best frontier models are at 3-6% on summarization, not 20% |
| Task type | Summarization, legal QA, medical advice, and free-form generation have wildly different rates |
| Retrieval quality | Strong RAG can push rates to 0-6%. Weak retrieval is barely better than no retrieval |
| Refusal behavior | A model that refuses uncertain answers is safer but less useful |
| Domain sensitivity | Legal is at 17-33%. Medical without retrieval is at ~40%. General summarization is at 3-6% |
| User verification | If nobody checks the output, the effective rate is higher |
If your current policy still says “LLMs hallucinate too much” without naming the task, it is out of date. And if your procurement process is still using 2023 assumptions, you are not being cautious. You are being imprecise.
The data has moved. The risk assessment should too.
Sources
| Source | URL |
|---|---|
| Vectara Hallucination Leaderboard (current, March 2026) | github.com/vectara |
| Vectara blog, “Next Generation Hallucination Leaderboard” (Nov 2025) | vectara.com |
| Stanford HAI, “AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More)“ | hai.stanford.edu |
| Magesh et al., “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools” | arxiv.org |
| JMIR Cancer 2025, RAG and hallucination in cancer chatbots | pubmed.ncbi.nlm.nih.gov |