AI Models Least & Most Likely to Invent Information

AI Models Least & Most Likely to Invent Information, Based on Hallucination Rates

AI Models Least & Most Likely to Invent Information, Based on Hallucination Rates

Source: Vectara

We look at hallucination rates for AI models from OpenAI, Google, Meta, Anthropic, and xAI.

Written By
Liz Ticong
Liz Ticong
Aug 12, 2025

OpenAI’s latest AI models are outpacing competitors from Google, Anthropic, xAI, and Meta in keeping their facts straight, according to new rankings. The results show stark differences in “hallucination rates,” or how often these AI models invent details.

The results come from Vectara’s Hughes Hallucination Evaluation Model (HHEM) Leaderboard, which measures the “ratio of summaries that hallucinate” across leading large language models. In head-to-head tests, ChatGPT models outperformed Gemini, Claude, Grok, and Meta AI, landing near the top of the accuracy race.

How the top AI tools stack up when the facts matter

Vectara’s HHEM Leaderboard is based on a large-scale test designed to determine whether AI models can adhere to the facts when summarizing real news articles. Each AI model was given the same set of short documents and scored on how often its summaries included information not found in the original text.

Refusal rates were also tracked, capturing how often an AI model declined to answer. With the conditions kept identical across the board, the results reveal which AI tools handle the truth best under the same pressure. Here’s how they performed.

OpenAI

OpenAI holds five of the lowest hallucination rates on the leaderboard, with ChatGPT-o3 mini at 0.795%, followed by ChatGPT-4.5, ChatGPT-5, ChatGPT-o1 mini, and ChatGPT-4o all clustered around the 1.2% to 1.49% mark.

That grounding in facts made the debut of ChatGPT-5 as the default model a strong move for the AI giant, until users pushed back, demanding the return of ChatGPT-4o. CEO Sam Altman relented, letting Plus subscribers choose their model.

But there’s a trade-off. Once free users hit their GPT-5 limit, they’re switched to ChatGPT-5 mini, a sharp drop in accuracy with a 4.9% hallucination rate that’s among the highest in OpenAI’s lineup. That could mean a sudden slide in how much you can trust the answers you get.

Advertisement

Google

Google’s Gemini 2.5 Pro Preview and Gemini 2.5 Flash Lite scored 2.6% and 2.9%, respectively. Not as low as OpenAI’s leaders, but still well clear of the highest-risk models. Pro Preview replaced the now-retired Gemini 2.5 Pro Experimental, which had once posted one of the lowest scores on the board at 1.1%.

Anthropic

Anthropic’s newest models, Claude Opus 4.1 and Claude Sonnet 4, post hallucination rates of 4.2% and 4.5%. Those scores place both models among the more error-prone models on the board, well behind leaders such as ChatGPT and Gemini.

Meta

Meta’s LLaMA 4 Maverick and LLaMA 4 Scout had 4.6% and 4.7% hallucination rates, putting them in the same ballpark as Claude’s latest models and outside the group of most accurate performers on the board.

xAI

Grok 4 posts a high hallucination rate of 4.8%, placing it among the least accurate models on the leaderboard. Elon Musk has promoted the newly released model as “smarter than almost all graduate students, in all disciplines,” pointing to its 26.9% score on On Humanity’s Last Exam.

The chatbot is also facing criticism for harmful and inappropriate outputs. This combination of a high error rate and ongoing content issues could make Grok a risky choice for fact-reliable answers.

More must-read AI coverage

Advertisement

Keeping track of truth in the age of AI

When AI gets it wrong, it can sound right. And when those made-up details slip past unnoticed, bending facts and spreading misinformation, it can lead to serious risks in areas like health, law, finance, and politics. That’s why ongoing, transparent testing is more important than ever.

Vectara’s HHEM Leaderboard updates with every model change, tracking in real time which AIs are improving and which are falling behind. As these systems weave deeper into search, messaging, and everyday tools, knowing which AI model stays closest to the truth is knowing what to trust.

In our closer look at OpenAI’s GPT-5, we focus on the AI model’s health-related benchmarks and guidelines.

Liz Ticong

Liz Ticong is a technology writer specializing in artificial intelligence, cybersecurity, software reviews, and emerging business technologies. With more than a decade of professional writing experience and over five years contributing technology content for TechnologyAdvice, she helps readers understand complex technologies and evaluate the tools that best fit their needs. Liz has extensive experience researching, testing, and analyzing software platforms, AI tools, and technology solutions. Her work includes in-depth software reviews, buyer’s guides, product comparisons, and technology news coverage designed to help businesses make informed purchasing and implementation decisions. She regularly evaluates AI applications, automation tools, cybersecurity solutions, and business software, providing practical insights based on hands-on testing and research. In addition to her work with TechnologyAdvice, Liz has contributed technology content to leading industry publications, including eWeek and TechRepublic. Her background in technical writing and software analysis enables her to translate complex technical concepts into clear, actionable guidance for both business and technology audiences. Liz holds a bachelor's degree in Broadcast Communication from the Polytechnic University of the Philippines and continues to expand her expertise through ongoing education in artificial intelligence and emerging technologies. Through her writing, she helps readers navigate a rapidly evolving technology landscape with practical, research-driven insights and real-world product analysis.