AA-Omniscience: Knowledge and Hallucination Benchmark

Name: AA-Omniscience Accuracy
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

A benchmark measuring factual recall and hallucination across various economically relevant domains.

See example tasks

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains.

The benchmark contains 6,000 questions across 6 major domains, derived from authoritative academic and industry sources and generated automatically using an LLM-based question generation agent to ensure unambiguity, scalability and factual precision

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Declan Jackson, William Keating, George Cameron, Micah Hill-Smith.

Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's AA-Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.

AA-Omniscience Methodology

ArtificialAnalysis/AA-Omniscience-Public

2511.13029

AA-Omniscience

Gemini 3.1 Pro Preview scores the highest on AA-Omniscience with a score of 33, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) with a score of 27, and Claude Opus 4.7 (Adaptive Reasoning, Max Effort) with a score of 26

AA-Omniscience Accuracy

GPT-5.5 (xhigh) scores the highest on AA-Omniscience Accuracy with a score of 57%, followed by GPT-5.5 (high) with a score of 56%, and Gemini 3 Pro Preview (high) with a score of 56%

AA-Omniscience Hallucination Rate

MiniCPM5-1B (Non-reasoning) scores the lowest on AA-Omniscience Hallucination Rate with a score of 1%, followed by Command A+ with a score of 14%, and Grok 4.3 (medium) with a score of 16%

Results

AA-Omniscience Index: Results

AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.

Independently benchmarked by Artificial Analysis

AA-Omniscience Index

AA-Omniscience Index vs. Artificial Analysis Intelligence Index

AA-Omniscience Index · Artificial Analysis Intelligence Index

Most attractive quadrant

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Meta

MiniMax

Mistral

NVIDIA

OpenAI

Upstage

xAI

Xiaomi

Z AI

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

AA-Omniscience Accuracy

AA-Omniscience Accuracy (higher is better) measures the proportion of correctly answered questions out of all questions, regardless of whether the model chooses to answer

Reasoning models are indicated by a lightbulb icon

AA-Omniscience Accuracy (higher is better) measures the proportion of correctly answered questions out of all questions, regardless of whether the model chooses to answer

AA-Omniscience Hallucination Rate

AA-Omniscience Hallucination Rate (lower is better) measures how often the model answers incorrectly when it should have refused or admitted to not knowing the answer. It is defined as the proportion of incorrect answers out of all non-correct responses, i.e. incorrect / (incorrect + partial answers + not attempted)

Reasoning models are indicated by a lightbulb icon

Detailed Domain Results

AA-Omniscience Index Across Domains (Normalized)

AA-Omniscience Index · Scores are normalized per domain across all models tested, where green represents the highest score for that domain and red represents the lowest score for that domain.

Reasoning models are indicated by a lightbulb icon

Software Engineering Deep Dive

Software Engineering AA-Omniscience Index Across Languages (Normalized)

Software Engineering AA-Omniscience Index · Scores are normalized per language across all models tested, where green represents the highest score for that language and red represents the lowest score for that language.

Reasoning models are indicated by a lightbulb icon

AA-Omniscience Index Question Breakdown

Distribution of questions across different domains and subdomains in the AA-Omniscience benchmark

Business

Humanities & Social Sciences

Science, Engineering & Mathematics

Health

Law

Software Engineering (SWE)

Model Size (Open Weights Models Only)

AA-Omniscience Index vs. Total Parameters

AA-Omniscience Index · Size in parameters (billions)

Most attractive quadrant

Alibaba

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Meta

MiniMax

Mistral

NVIDIA

OpenAI

Upstage

Xiaomi

Z AI

The total number of trainable weights and biases in the model, expressed in billions. These parameters are learned during training and determine the model's ability to process and generate responses.

AA-Omniscience Accuracy vs. Total Parameters

AA-Omniscience Accuracy · Size in parameters (billions)

Most attractive quadrant

Alibaba

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Meta

MiniMax

Mistral

NVIDIA

OpenAI

Upstage

Xiaomi

Z AI

AA-Omniscience Accuracy (higher is better) measures the proportion of correctly answered questions out of all questions, regardless of whether the model chooses to answer

The total number of trainable weights and biases in the model, expressed in billions. These parameters are learned during training and determine the model's ability to process and generate responses.

AA-Omniscience Hallucination Rate vs. Total Parameters

AA-Omniscience Hallucination Rate · Size in parameters (billions)

Most attractive quadrant

Alibaba

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Meta

MiniMax

Mistral

NVIDIA

OpenAI

Upstage

Xiaomi

Z AI

The total number of trainable weights and biases in the model, expressed in billions. These parameters are learned during training and determine the model's ability to process and generate responses.

Token Usage

AA-Omniscience Index: Token Usage

Tokens used to run the evaluation

Input tokens

Reasoning tokens

Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

AA-Omniscience Index: Cost Breakdown

Cost (USD) to run the evaluation

Input cost

Reasoning cost

Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Score vs. Release Date

AA-Omniscience Index: Score vs. Release Date

Most attractive region

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Meta

MiniMax

Mistral

NVIDIA

OpenAI

Upstage

xAI

Xiaomi

Z AI

Example Tasks

Explore Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

AA-Omniscience: Knowledge and Hallucination Benchmark

Background

Methodology

Publication

Abstract

Related Links

AA-Omniscience

AA-Omniscience Accuracy

AA-Omniscience Hallucination Rate

Results

AA-Omniscience Index: Results

AA-Omniscience Index

AA-Omniscience Index vs. Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index

AA-Omniscience Index

AA-Omniscience Accuracy

AA-Omniscience Accuracy

AA-Omniscience Accuracy

AA-Omniscience Hallucination Rate

AA-Omniscience Hallucination Rate

AA-Omniscience Hallucination Rate

Detailed Domain Results

AA-Omniscience Index Across Domains (Normalized)

AA-Omniscience Index

Software Engineering Deep Dive

Software Engineering AA-Omniscience Index Across Languages (Normalized)

AA-Omniscience Index

AA-Omniscience Index Question Breakdown

Business

Humanities & Social Sciences

Science, Engineering & Mathematics

Health

Law

Software Engineering (SWE)

AA-Omniscience Index

Model Size (Open Weights Models Only)

AA-Omniscience Index vs. Total Parameters

AA-Omniscience Index

Total Parameters

AA-Omniscience Accuracy vs. Total Parameters

AA-Omniscience Accuracy

Total Parameters

AA-Omniscience Hallucination Rate vs. Total Parameters

AA-Omniscience Hallucination Rate

Total Parameters

Token Usage

AA-Omniscience Index: Token Usage

Evaluation Token Usage

Cost

AA-Omniscience Index: Cost Breakdown

Evaluation Cost

Score vs. Release Date

AA-Omniscience Index: Score vs. Release Date

Example Tasks

Explore Evaluations