Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
For EnterpriseInsights
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • Articles
For EnterpriseInsights
All articles

November 16, 2025

AA-Omniscience: Knowledge and Hallucination Benchmark

Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer

For up to date AA-Omniscience scores, see the AA-Omniscience evaluation page.

Embedded knowledge in language models is important for many real world use cases. Without knowledge, models make incorrect assumptions and are limited in their ability to operate in real world contexts. Tools like web search can support but models need to know what to search for (e.g. models should not search for 'Multi Client Persistence' for an MCP query when it clearly refers to 'Model Context Protocol').

Hallucination of factual information is a barrier to being able to rely on models and has been perpetuated by every major evaluation dataset. Grading correct answers with no penalty for incorrect answers creates an incentive for models (and the labs training them) to attempt every question. This problem is clearest when it comes to knowledge: factual information should never be made up, while in other contexts attempts that might not work are useful (e.g. coding new features).

Omniscience Index is the key metric we report for AA-Omniscience, and it tests hallucination by punishing models for for guessing over admitting they do not know the answer. AA-Omniscience shows that all but three models are more likely to hallucinate than provide a correct answer when given a difficult question. AA-Omniscience will complement the Artificial Analysis Intelligence Index to incorporate measurement of knowledge and probability of hallucination.

AA-Omniscience details:

  • 🔢 6,000 questions across 42 topics within 6 domains ('Business', 'Humanities & Social Sciences', 'Health', 'Law', 'Software Engineering', and 'Science, Engineering & Mathematics')
  • 🔍 89 sub-topics including Python data libraries, Public Policy, Taxation, and more, giving a sharper view of where models excel and where they fall short across nuanced domains
  • 🔄 Incorrect answers are penalized in our Knowledge Reliability Index metrics to punish hallucinations
  • 📊 3 Metrics: Accuracy (% correct), Hallucination rate (% incorrect of incorrect/abstentions), Omniscience Index (+1 for correct, -1 for incorrect where answered, 0 for abstentions where the model did not try to answer)
  • 🤗 Open source test dataset: We're open sourcing 600 questions (10%) to support labs develop factual and reliable models. Topic distribution and model performance follows the full set (see HuggingFace repo)
  • 📃 Paper: See AA-Omniscience paper for more details

Key findings:

  • 🥇 Claude 4.1 Opus takes first place in Omniscience Index, followed by last week's GPT-5.1 and Grok 4: Even the best frontier models score only slightly above 0, meaning they produce correct answers on the difficult questions that make up AA-Omniscience only marginally more often than incorrect ones. Anthropic's leadership is driven by low hallucination rate, whereas OpenAI and xAI's positions are primarily driven by higher accuracy (percentage correct).

Artificial Analysis Omniscience IndexArtificial Analysis Omniscience Index

  • 🥇 xAI's Grok 4 takes first place in Omniscience Accuracy (our simple 'percentage correct' metric), followed by GPT-5 and Gemini 2.5 Pro: xAI's win may be enabled by scaling total parameters and pre-training compute: Elon Musk revealed last week that Grok 4 has 3 trillion total parameters, which may be larger than GPT-5 and other proprietary models

Artificial Analysis Omniscience AccuracyArtificial Analysis Omniscience Accuracy

  • 🥇 Claude sweeps the hallucination leaderboard: Anthropic takes the top three spots for lowest hallucination rate, with Claude 4.5 Haiku leading at 28%, over three times lower than GPT-5 (high) and Gemini 2.5 Pro. Claude 4.5 Sonnet and Claude 4.1 Opus follow in second and third at 48%

Artificial Analysis Omniscience Hallucination RateArtificial Analysis Omniscience Hallucination Rate

  • 💭 High knowledge does not guarantee low hallucination: Hallucination rate measures how often a model guesses when it lacks the required knowledge. Models with the highest accuracy, including the GPT-5 models and Gemini 2.5 Pro, do not lead the Omniscience Index due to their tendency to guess over abstaining. Anthropic models tend to manage uncertainty better, with Claude 4.5 Haiku achieving the lowest hallucination rate at 26%, ahead of 4.5 Sonnet and 4.1 Opus (48%)

Omniscience Accuracy vs Omniscience IndexOmniscience Accuracy vs Omniscience Index

  • 📊 Models vary by domain: Models differ in their performance across the six domains of AA-Omniscience - no model dominates across all. While Anthropic's Claude 4.1 Opus leads in Law, Software Engineering, and Humanities & Social Sciences, GPT-5.1 from Open AI achieves the highest reliability on Business questions, and xAI's Grok 4 performs best in Health and in Science, Engineering & Mathematics. Model choice should align with the use case rather than choosing the overall leader

Artificial Analysis Omniscience Index Across Domains (Normalized)Artificial Analysis Omniscience Index Across Domains (Normalized)

  • 📈 Larger models score higher on accuracy, but not always reliability: Larger models tend to have higher levels of embedded knowledge, with Kimi K2 Thinking and DeepSeek R1 (0528) topping accuracy charts over smaller models. This advantage does not always hold on the Omniscience Index. For example, Llama 3.1 405B from Meta beats larger Kimi K2 variants due to having one of the lowest hallucination rates among models (51%)

Artificial Analysis Omniscience Index vs Total Parameters (above); Artificial Analysis Accuracy vs Total Parameters (below)Artificial Analysis Omniscience Index vs Total Parameters (above); Artificial Analysis Accuracy vs Total Parameters (below)

Read more about the evaluation and methodology in our AA-Omniscience paper

Explore sample questions and evaluate your model on a subset of AA-Omniscience with our HuggingFace dataset

See detailed AA-Omniscience results on the AA-Omniscience evaluation page

Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn