Artificial Analysis Intelligence Benchmarking Methodology
Artificial Analysis Intelligence Index v4.1
Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index combines a comprehensive suite of evaluation datasets to assess language model capabilities across reasoning, knowledge, maths and programming.
It is a helpful synthesis of overall language model intelligence and can be used to compare language models. Like all evaluation metrics, it has limitations and may not apply directly to every use case. However, we are confident that it is a more useful synthesis comparison between language models than any other metric in existence today.
Artificial Analysis Intelligence Index Artificial Analysis Intelligence Index v4.1 incorporates 9 evaluations: GDPval-AA v2, 𝜏³-Banking, Terminal-Bench v2.1, SciCode, AA-LCR, AA-Omniscience, Humanity's Last Exam, GPQA Diamond, CritPt. Our methodology emphasizes fairness and real-world applicability.
We estimate a 95% confidence interval for Artificial Analysis Intelligence Index of less than ±1% - based on experiments with >10 repeats on certain models for all evaluation datasets included in Artificial Analysis Intelligence Index v4.1. Individual evaluation results may have wider than ±1% confidence intervals. We look forward to disclosing further detail from our statistical analysis in the future.
Artificial Analysis Intelligence Index is a text-only, English language evaluation suite. We benchmark models for image inputs, speech inputs and multilingual performance separately to the Intelligence Index evaluation suite.
Intelligence Index evaluation suite
Intelligence Index is calculated as a weighted average across four categories: Agents (34%), Coding (24%), Scientific Reasoning (24%) and General (18%). The weighting emphasizes agentic tasks. Category membership and per-evaluation weights are shown below.
| Category | Evaluation | Questions | Repeats | Response Type | Scoring | Intelligence Index Weighting | Tool Usage |
|---|---|---|---|---|---|---|---|
| Agents (34%) | GDPval-AA v2 | 220 tasks | 1 | Agentic task completion with file outputs | Pairwise comparison (Elo) by judge panel, anchored to human experts at 1000, frozen & scaled | 20% | ✓ |
| 𝜏³-Banking | 97 | 5 | Dual control agent-user simulation with knowledge retrieval | Backend database state evaluation, pass@1 | 14% | ✓ | |
| Coding (24%) | Terminal-Bench v2.1 | 89 | 3 | Terminal-based task execution | Test suite pass/fail, pass@1 | 16% | |
| SciCode | 288 subproblems (test set) | 3 | Python Code (must pass all unit tests) | Code execution, pass@1, sub-problem scoring with scientist-annotated background prompting | 8% | ✗ | |
| General (18%) | AA-LCR | 100 | 3 | Open Answer | Equality Checker LLM, pass@1 | 6% | ✗ |
| AA-Omniscience | 6,000 | 1 | Open Answer | Accuracy (8%) and 1 - Hallucination Rate (4%) as separate components | 12% | ✗ | |
| Scientific Reasoning (24%) | HLE (Humanity's Last Exam) | 2,158 | 1 | Open Answer | Equality Checker LLM, pass@1 | 12% | ✗ |
| GPQA Diamond | 198 | 5 | Multiple Choice (4 options) | Regex extraction, pass@1 | 6% | ✗ | |
| CritPt | 70 | 5 | Python Functions, Symbolic Expressions, Numerical Answers | Official grading server, pass@1 | 6% | ✗ |
Additional Evaluations
Beyond the Intelligence Index suite, we run a range of additional evaluations covering multilingual, visual, mathematical and other capabilities. These are reported separately and are not included in the Intelligence Index score.
Artificial Analysis Multilingual Index: Represents the multilingual ability of models. This is based on the Global-MMLU-Lite evaluation across supported languages. We support the following languages:
- 🇬🇧 English
- 🇨🇳 Chinese
- 🇮🇳 Hindi
- 🇪🇸 Spanish
- 🇫🇷 French
- 🇸🇦 Arabic
- 🇧🇩 Bengali
- 🇵🇹 Portuguese
- 🇮🇩 Indonesian
- 🇯🇵 Japanese
- 🇰🇪 Swahili
- 🇩🇪 German
- 🇰🇷 Korean
- 🇮🇹 Italian
- 🇳🇬 Yoruba
- 🇲🇲 Burmese
| Category | Evaluation | Questions | Repeats | Response Type | Scoring | Tool Usage |
|---|---|---|---|---|---|---|
| Agents | AA-Briefcase | 4 scenarios | 1 | Agentic task completion with file outputs | Pairwise comparison (Elo) and rubric-based local file grading, pass@1 | ✓ |
| APEX-Agents-AA | 452 tasks | 3 | Agentic professional-service task completion | Rubric-based local file grading, pass@1 | ✓ | |
| ITBench-AA | 59 scenarios (public + private) | 3 | Structured JSON root-cause diagnosis from offline Kubernetes incident snapshot | LLM-normalized entity matching, average precision at full recall | ✓ | |
| Coding | LiveCodeBench | 315 | 3 | Python Code (must pass all unit tests) | Code execution, pass@1 | ✗ |
| General | IFBench | 294 | 5 | Open Answer | Extraction and rule-driven assessment, pass@1 | ✗ |
| MMLU-Pro | 12,032 | 1 | Multiple Choice (10 options) | Regex extraction, pass@1 | ✗ | |
| Other | Global-MMLU-Lite | ~6,000 (~400 per language) | 1 | Multiple Choice (4 options) | Regex extraction, pass@1 | ✗ |
| MMMU Pro | 1,730 | 1 | Multiple Choice (10 options) | Regex extraction, pass@1 | ✗ |
Intelligence Evaluation Principles
Our evaluation approach is guided by four core principles:
- Standardized: All models are evaluated under identical conditions with consistent prompting strategies, temperature settings, and evaluation criteria.
- Unbiased: We employ evaluation techniques that avoid unfairly penalizing models for answers that correctly follow the instructions in our prompts. This includes using clear prompts, robust answer extraction methods, and flexible answer validation to accommodate valid variations in model outputs.
- Zero-Shot Instruction Prompted: We evaluate using clear instructions without examples or demonstrations, testing models' ability to follow directions without few-shot learning. This approach aligns with modern instruction-tuned and chat style models.
- Transparent: We fully disclose our methodology, including prompt templates, evaluation criteria, and limitations.
General Testing Parameters
We test all evals with the following settings:
- Temperature: 0 for non-reasoning models, 0.6 for reasoning models (unless another temperature is recommended by the model lab)
- Maximum output tokens:
- Non-reasoning models: 16,384 tokens (adjusted downward where models have a smaller context window, or lower maximum output tokens cap)
- Reasoning models: Maximum output tokens allowed, as disclosed by model creators (custom setting for each reasoning model)
- Code evaluation environment:
- Ubuntu 22.04 LTS
- Python 3.12
- Error handling:
- Automatic retry on API failures (up to 30 attempts)
- All questions that failed all 30 retries are manually reviewed. Results where persistent API failures have caused issues are not published. Errors where all available APIs for proprietary models block a certain question may lower scores (this effect is not material)
- Scoring methodology: We generally use pass@1 scoring across our evaluations, where a model must produce the correct answer on its first attempt. For evaluations with multiple repeats, pass@1 is calculated by aggregating results across all repeats. This is calculated as:where pi = 1 if attempt i is correct, 0 otherwise, and k is the total number of test instances across all repeats.
We maintain internal copies of all evaluation datasets. The sources of our selected datasets are listed below.
For the Artificial Analysis Intelligence Index evaluations, we use the token counts reported by each model's API provider, to accurately report the cost of running the Intelligence Index. This contrasts with the approach in performance benchmarking, where we use client-side token counts from the o200k_base tokenizer to standardize token counts for the same text across models. When reporting cache hit rates and cost, we combine provider token counts with live measurements of the model's typical cache hit rate, rather than relying on the one-off measurement when the evaluation was run.
Artificial Analysis Intelligence Index Evaluations
Evaluations that make up the current Artificial Analysis Intelligence Index, grouped by capability.
Agents
GDPval-AA v2
- Description: GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It assesses language models' capabilities on economically valuable tasks, covering 44 occupations across key sectors contributing to GDP in the United States.
- Changes from GDPval-AA v1: GDPval-AA v2 is a minor upgrade to the original GDPval-AA methodology used in Intelligence Index v4.0. It incorporates:
- An upgraded sandbox with new and expanded dependencies, plus fixes to minor environment issues and to prompt clarity and consistency
- Elo scores re-baselined to human expert performance at 1000
- A panel of three frontier LLM judges from leading labs, replacing a single judge
- Turn limits expanded to 250 turns to allow for even longer-horizon agent trajectories, and the ability for models to exit early where they don't believe they can complete the task
- Paper: https://arxiv.org/abs/2510.04374
- Agent harness: https://github.com/ArtificialAnalysis/Stirrup
- Dataset:
- We base our evaluation on the public gold OpenAI GDPval dataset from https://huggingface.co/datasets/openai/gdpval
- Some Microsoft Office files in the dataset had missing metadata parts or malformed relationship entries that prevented LibreOffice from opening them. We added the minimal missing metadata and fixed the malformed entries to ensure compatibility. Document body, slide content, and layout were not changed.
- Implementation: This evaluation comprises two stages:
- Task Submission – Models are given a task and required to produce one or more files.
- Pairwise Grading – A judge sampled from a panel of three frontier LLM judges blindly ranks two submissions for the same task, each created by a different model.
- Elo Calculation: After collecting pairwise rankings, we fit them to a Bradley-Terry model via maximum likelihood estimation and compute confidence intervals using the sandwich estimator to establish our final Elo metric. The Elo scale is anchored to human expert deliverables at 1000.
- Intelligence Index Integration: For inclusion in the Intelligence Index, GDPval-AA v2 Elo scores are frozen at the time of a model's addition and normalized as clamp((Elo - 500) / 2000) for inclusion in the Intelligence Index. The v2 Elo scale is anchored to human expert deliverables at 1000, while the fixed normalization range preserves stable Intelligence Index contributions over time. Artificial Analysis may update the reference parameters as models progress against the evaluation, to maintain meaningful differentiation in the Intelligence Index.
- Task Submission Details:
- All models are run using our open source agentic harness, Stirrup. Within the harness, models are given a code execution environment (E2B sandbox), and the following six tools to call at their discretion:
- Web Fetch – Fetches and extracts main content from a web page as markdown.
- Web Search – Searches the web using Brave Search API; returns the top 5 results with title, URL, and description.
- View Image – Reads and displays image files (.png, .jpg, .jpeg) from the sandbox as native image tokens for LLM consumption. This tool is only exposed to models with vision support. Images are downscaled to a maximum of 1 megapixel before being sent to the model.
- Code Exec – Executes bash commands in the sandbox via the
code_exectool; returns exit code, stdout, and stderr. - Finish – Signals task completion and specifies which files to submit.
- Abandon Task – Signals that the model does not believe it can complete the task, with a brief reason, instead of submitting files.
- For each task, a new E2B sandbox is initialized with the reference files associated with the given task and pre-installed with a range of relevant packages for the task set. We based the package collection on the disclosed environment from the original GDPval paper, expanded in v2 with additional dependencies (including a full TeX Live LaTeX toolchain and build tools).
- We prompt the agent with an instruction interpolating the relevant task prompt, reference files, and finish tool details.
- All models are run using our open source agentic harness, Stirrup. Within the harness, models are given a code execution environment (E2B sandbox), and the following six tools to call at their discretion:
- Execution Limits:
- The LLM is given 250 turns to complete the task. A single turn is defined as an assistant message and its tool calls (if any). As the model approaches the limit, it is notified of its remaining turn budget.
- The model may end the run early via the Abandon Task tool where it does not believe it can complete the task, providing a brief reason instead of submitting files.
- If the model exceeds 70% of its context window after completing a given turn, the agent asks it to summarize task state, completed work, current files, remaining steps, and important context, then clears earlier turn history while retaining the task prompt and the summary for continuation.
Task Submission System Prompt:
You are an AI agent completing a standalone professional task. Your job is to use the provided tools to produce the requested deliverables within 250 steps, then submit your work.
When you are done, call the `finish` tool as your final step with:
1. A brief summary of what you accomplished.
2. Absolute paths to every deliverable file.
If you have genuinely concluded that the task cannot be completed because required inputs are missing, a hard dependency is unavailable, or the request is incoherent, call the `abandon_task_finish` tool with a brief reason instead. Do not use it to escape difficulty.
You cannot interact with the user during the task. Make reasonable assumptions when needed and record them in your finish summary.Task Submission Prompt:
## Runtime
You are running in an isolated Linux sandbox. Use the `code_exec` tool to read, create, and modify files. Commands run as the non-root user `user` (UID 1000). Default working directory is `/home/user`.
Every command runs independently: no working directory, environment variable, or other shell state carries over from one call to the next. Prefer absolute paths for both files and commands, and do not navigate with `cd` across calls — a `cd` in one command is gone by the next, so relying on it leaves you silently operating in the wrong place. When a step genuinely needs a different directory, chain it into the same command (e.g. `cd /home/user/work && python build.py`).
A broad scientific-computing and document-processing stack is already installed, so confirm what is present before assuming a gap:
- Python 3.13 with the usual data stack (numpy, pandas, polars, scipy), plotting (matplotlib, plotly), the scikit-learn ML family, and document tooling (python-docx, python-pptx, openpyxl, PyMuPDF, pdfplumber, reportlab, weasyprint, Pillow, opencv), plus Playwright.
- System tools include LibreOffice, Pandoc, Tesseract, FFmpeg, ImageMagick, Ghostscript, TeX Live, OpenJDK, Chromium, jq, and git.
- Commands are terminated after 10 minutes. Keep them bounded, persist intermediate results to disk, and split long jobs into smaller steps.
## Reference Files Location
(This section appears only when the task includes reference files.)
The reference files for the task are available in your environment's file system.
Here are their paths:
- [absolute path to each reference file]
## Completing Your Work
In order to complete the task you must use the `finish` tool to submit your work. If you do not use the `finish` tool you will fail this task!
As a last resort if you really cannot make any meaningful progress, use `abandon_task_finish` with a brief reason instead of submitting files.
**Required in your finish call:**
1. A brief summary of what you accomplished
2. A list of **ABSOLUTE file paths** for the required output files (Do not submit folders).
## Task
Here is the task you need to complete:
[task description]
Please begin working on the task now.- Context Overflow: If the next model call (or the summarization request itself) exceeds the context window, the agent keeps unwinding earlier turns until summarization succeeds.
- Task Completion: To complete the task, the LLM must call the finish tool, providing a summary of the work done and the paths of the files it intends to submit. This tool can be used at any turn.
- Grading: We sample pairwise matches between model submissions in two stages:
- Balanced sampling: We first sample each model diversely, balancing exposure across tasks, judges, and opponents, to seed initial ratings.
- Active sampling: After the initial phase, we transition to Elo-informed sampling that prioritizes pairings between models with similar ratings to derive the most information per comparison. We maintain balanced exposure of tasks within each model throughout the process.
- Submissions are randomly anonymized as Submission A and B to mitigate any model or position bias from the grader model.
- Matches are graded by a panel of three frontier LLM judges from leading labs, each run at its default reasoning settings: GPT-5.5 (medium reasoning), Gemini 3.1 Pro Preview (high reasoning), and Claude Opus 4.8 (high effort). We sample between the judges for each comparison. The initial task, all reference files, and all submission files are parsed and provided as context to the judge.
- Document-based files (.pdf, .docx, .pptx, .xlsx, etc.) are parsed as both text and as images. We extract .zip files and parse each individual file separately. For tasks containing audio or video files, the comparison is routed to Gemini 3.1 Pro Preview, which handles these modalities natively. This context is embedded in a grading prompt that asks the judge to determine which of Submission A and B better responds to the task.
- Final Scoring: Our final Elo score is a Bradley-Terry rating computed via maximum likelihood estimation from all pairwise comparisons (ties counted as half-wins for each side), anchored to human expert deliverables at 1,000 — re-baselining the scale to human performance. The 95% confidence intervals are computed using the sandwich estimator to quantify rating uncertainty.
𝜏³-Banking
- Description: Fintech customer-support domain of the 𝜏-Knowledge framework developed by Sierra, evaluating agents that must coordinate retrieval from a large unstructured knowledge base with multi-step tool-mediated account changes
- Paper: https://arxiv.org/abs/2603.04370
- Blog: sierra.ai/blog/bench-advancing-agent-benchmarking-to-knowledge-and-voice
- Dataset: https://github.com/sierra-research/tau2-bench
- Implementation:
- Agents handle ~700 interconnected policy documents (≈195K tokens, 21 product categories) and must locate the relevant policy, reason over it, and execute a multi-step sequence of tool calls — including tools referenced only in documentation rather than explicitly listed
- We evaluate the full 𝜏³-Banking task suite (97 tasks) with 5 repeats per task and report pass@1 averaged across the repeats
- Outcomes are scored against actual backend database state — for example, whether a dispute was opened or a provisional credit issued — rather than conversational quality
- We use GPT-5.4 Mini (medium reasoning) for both the user simulator and the natural-language assertion judge
- For knowledge retrieval over the banking corpus we enable BM25 lexical search and grep (
bm25_grepmode) inside the original 𝜏-Bench harness - We apply a constraint on execution to limit steps to a maximum of 200 per task repeat (the 𝜏-Knowledge reference default for text-mode runs) and terminate a task early once it reaches a maximum of 10 tool-execution errors
Coding
Terminal-Bench v2.1
- Description: A verified refresh of Terminal-Bench, developed by Stanford University researchers, the Laude Institute, and the open source community. Keeps the same 89 curated tasks across software engineering, system administration, data processing, model training, and security, with environment and instruction fixes that make scores reflect agent capability rather than environment gaps
- Paper: https://arxiv.org/abs/2601.11868
- Leaderboard: tbench.ai/leaderboard/terminal-bench/2.1
- Implementation:
- We evaluate the full Terminal-Bench v2.1 dataset (89 tasks) using the Terminus 2 agent harness in an E2B sandbox environment, with pass@1 scoring averaged over 3 repeats per task
- Each task ships with a verification suite that the agent must satisfy by interacting with the terminal — tasks are considered successful only if every test passes
- We apply the following constraints on evaluations for the agent:
- Maximum 'episodes' (where the model reviews current state and plans a series of next actions at the terminal) are limited to 250
- Per-task agent timeout is set to two hours (7,200 seconds), or the task's own specified timeout where that is longer, well above typical task durations
- In our testing these constraints predominantly limit cases where models are stuck in an unsuccessful loop, and we see no consistent differences in performance due to these constraints
SciCode
- Description: Python programming to solve scientific computing tasks
- Paper: https://arxiv.org/abs/2407.13168
- Dataset: https://scicode-bench.github.io/
- Implementation:
- We test with scientist-annotated background information included in the prompt
- We report sub-problem level scoring
- Pass@1 evaluation criteria
General
AA-LCR
- Description: Evaluate long context performance through testing reasoning capabilities across multiple long documents (~100k tokens measured using cl100k_base tokenizer)
- Implementation:
- 100 hard text-based questions spanning 7 categories of documents (Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials, and Survey Reports)
- ~100k tokens (measured using cl100k_base tokenizer) of input per question, requiring models to support a minimum 128K context window to score on this benchmark. ~3M total unique input tokens spanning ~230 documents to run the benchmark (output tokens typically vary by model)
- Model responses are evaluated using Qwen3 235B A22B 2507 Non-Reasoning as an equality checker with pass@1 scoring
AA-Omniscience
- Description: AA-Omniscience is a knowledge and hallucination benchmark that measures factual reliability, rewards precise knowledge, and penalizes incorrect guesses or hallucinations. It provides a detailed assessment of a model’s ability to distinguish known from unknowns across diverse knowledge domains.
- Dataset: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omniscience-Public
- Implementation:
- The benchmark consists of 6,000 questions covering 42 topics, including Business, Humanities and Social Sciences, Health, Law, Software Engineering, and Science, Engineering and Mathematics.
- Models are scored using the AA-Omniscience Index, which assigns points for correct answers, subtracts points for hallucinated responses, and keeps abstentions neutral, rewarding abstentions over incorrect guesses
- Each answer is graded as either
CORRECT,INCORRECT,PARTIALLY_CORRECT, orNOT_ATTEMPTEDbased on the model's response and the ground truth answer. Gemini 3 Flash Preview (Reasoning) is used as the grading model - Intelligence Index Integration: AA-Omniscience contributes two components to the Intelligence Index: (1) Accuracy - the proportion of correct answers, weighted at 8% of the overall Index, and (2) Non-Hallucination Rate - calculated as 1 minus the hallucination rate, weighted at 4% of the overall Index (a 2:1 split of AA-Omniscience's 12% share).
Scientific Reasoning
HLE (Humanity's Last Exam)
- Description: Recent frontier academic benchmark from the Centre for AI Safety (led by Dan Hendrycks)
- Paper: https://arxiv.org/abs/2501.14249v2
- Dataset: https://huggingface.co/datasets/cais/hle
- Implementation:
- 2,158 text-only questions across mathematics, humanities and the natural sciences (from the May 2025 revision which contains 2,500 total questions — we use the text-only subset for maximum comparability across models)
- We note that the HLE authors disclose that their dataset curation process involved adversarial selection of questions based on tests with GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, o1, o1-mini, and o1-preview (latter two for text-only questions only). We therefore discourage direct comparison of these models with models that were not used in the HLE curation process, as the dataset is potentially biased against the models used in the curation process.
- Evaluated with an equality checker LLM prompt, using OpenAI's GPT-4o (Aug) model per the original HLE paper, with pass@1 scoring (find prompt below)
GPQA Diamond (Graduate-Level Google-Proof Q&A Benchmark)
- Description: Scientific knowledge and reasoning benchmark
- Subset: Diamond subset (198 questions) selected for maximum accuracy and discriminative power
- Paper: https://arxiv.org/abs/2311.12022
- Dataset: https://github.com/openai/simple-evals/blob/main/gpqa_eval.py
- Key Details:
- 198 questions covering biology, physics and chemistry - we test the GPQA Diamond subset of the full GPQA dataset (448 questions total), which was defined by the original authors as the highest quality subset, where both experts answer correctly and the majority of non-experts answer incorrectly
- 4 option multiple choice format
- Regex-based answer extraction with pass@1 scoring (prompt and regex below)
CritPt
- Description: Research-level physics reasoning benchmark with unpublished, frontier physics problems spanning a wide range of subfields
- Paper: https://arxiv.org/abs/2509.26574
- Website: https://critpt.com/
- Repository: https://github.com/CritPt-Benchmark/CritPt
- Dataset: https://huggingface.co/datasets/CritPt-Benchmark/CritPt
- Implementation:
- We implement the 'challenge' level components for all 70 test-set challenges (the example challenge is excluded) in collaboration with the CritPt team
- We run 5 repeats for each question with pass@1 scoring
- The models are called with a two-step parsing approach, where the first step requests that the model complete the challenge with reasoning, and the second step formats the response into the expected code format for grading (see example prompt for parsing on the CritPt evaluation page)
- Token usage and cost estimates reflect both steps (reasoning and answer parsing)
- Answer formats include numerical values, symbolic expressions in SymPy, and Python functions (evaluated with test cases)
- The official CritPt grading server is used to assess all challenge responses for correctness. See the Artificial Analysis API documentation for access to the grading API for model evaluations
Additional Evaluation Details
Agents
AA-Briefcase
- Description: AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
- Example dataset: https://huggingface.co/datasets/ArtificialAnalysis/AA-Briefcase-Lite
- Agent harness: https://github.com/ArtificialAnalysis/Stirrup
- Implementation:
- Each AA-Briefcase scenario is a realistic multi-week business problem, organized as a multi-week workflow that the agent works through in sequence, with 2-5 tasks per week. Although tasks within a scenario share files and context across weeks, models currently complete each task in an independent run, without carrying over their own prior submissions. The agent receives the task description and accessible source files, then produces final deliverable files without live interaction or iterative feedback during execution.
- Scenario source pools include shared files and week-specific files, mixing real, augmented, and synthetic materials. Source files are designed to include realistic professional artifacts such as Slack exports, spreadsheets, PDFs, interview transcripts, market research, standards documents, app-store pages, board materials, emails, and other business records. Later-week tasks may receive standardized base-case files (the same reference work products given to every model), so each task stays independently runnable while preserving continuity across the week.
- Model submissions are run with Stirrup in a week-scoped E2B sandbox.
- Turns: Agents run for up to 500 turns per task.
- Tools: The agent is given a single code-execution tool that runs shell commands and code inside the sandbox, plus the finish tools below (and a view-image tool when the model supports vision). The sandbox has no internet access, so the agent can only use the provided source files.
- Sandbox: Each scenario/week sandbox is built from that week's source files, with standard Python packages and system tools for document processing and scientific computing pre-installed.
- Finish tools: A finish tool, which the agent calls to submit a summary and the absolute paths of its deliverables (validated to be actual files, not directories or missing paths), and an abandon_task_finish (give-up) tool, which it calls with a reason only when it concludes the task is genuinely impossible.
- Prompts: The prompts used across generation and grading:
- Agent system prompt:
You are an AI agent working on a specific task within a multi-week simulated workplace scenario. Each task is part of a longer workflow; your job is to complete the current task using the tools provided in up to 500 steps, then submit your deliverables. When you are done you must call the `finish` tool as your final step, passing a brief summary of what you accomplished and a list of absolute paths for every deliverable file. If you have genuinely concluded that the task cannot be completed — for example because required inputs are missing, a hard dependency is unavailable, or the request itself is incoherent — call the `abandon_task_finish` tool with a brief reason instead. Do not use it to escape difficulty. You cannot interact with the user during the task. Record any clarifying assumptions you made in your finish summary. - Agent task prompt:
<execution_context> ## Sandbox You operate inside an isolated Linux container through the `code_exec` tool, which runs shell commands and lets you read, create, and edit files. Commands run as the unprivileged user `user` (UID 1000), starting from `/home/user`. Passwordless `sudo` exists but is rarely needed, since your home directory is fully writable. Every command runs independently: no working directory, environment variable, or other shell state carries over from one call to the next. Prefer absolute paths for both files and commands, and do not navigate with `cd` across calls — a `cd` in one command is gone by the next, so relying on it leaves you silently operating in the wrong place. When a step genuinely needs a different directory, chain it into the same command (e.g. `cd /home/user/work && python build.py`). ## No network The container has no outbound connectivity, and there is no proxy, allowlist, or flag that can turn it on — treat the environment as permanently offline. Anything that reaches for the internet will fail, including package installs (`pip`, `npm`, `apt`), remote `git` operations, and any HTTP/HTTPS client request from any language. Identify a network block by its error signature rather than by guessing: failed name resolution (`Could not resolve host`, `Temporary failure in name resolution`), an unreachable route (`Network is unreachable`, a refused or timed-out connection to a public host), or a stalled TLS handshake. When you see these, the failure is structural — do not retry the same call and do not hunt for a workaround (mirrors, alternate hosts, cached copies). Re-plan using only what is already installed and what ships inside your workspace. ## Filesystem - Writable: everything under `/home/user/` plus `/tmp`. Use these for deliverables, intermediate files, and caches. - Read-only inputs: - `/home/user/shared/` — reference material shared across the whole scenario - `/home/user/week/` — documents specific to this week's tasks Copy these into a working folder before transforming them rather than editing them in place. ## Runtime A broad scientific-computing and document-processing stack is already installed, so confirm what is present before assuming a gap: - Python 3.13 with the usual data stack (numpy, pandas, polars, scipy), plotting (matplotlib, plotly), the scikit-learn ML family, and document tooling (python-docx, python-pptx, openpyxl, PyMuPDF, pdfplumber, reportlab, weasyprint, Pillow, opencv), plus Playwright. - System tools include LibreOffice, Pandoc, Tesseract, FFmpeg, ImageMagick, Ghostscript, TeX Live, OpenJDK, Chromium, jq, and git. - Check availability with `pip show <pkg>` or `which <tool>` instead of installing — installs fail offline, but almost anything you would reach for is already here. - matplotlib runs headless (`MPLBACKEND=Agg`): write figures to files; never call `plt.show()`. - Commands are terminated after 20 minutes. Keep them bounded, persist intermediate results to disk, and split long jobs into smaller steps. ## Submitting your work Finish by calling the `finish` tool — anything not submitted through it is not graded. Your call must include: 1. A short summary of what you accomplished. 2. Absolute paths to every deliverable (files only, not folders). Save each deliverable directly in `/home/user` under the exact filename the task asks for — not in a subdirectory. Save deliverables as ordinary, visible files. Do not leave the only copy of your work in a dot-prefixed file or directory (e.g. `.submission.txt`, `.outputs/report.md`), including inside an archive; a `.zip` is fine when the task explicitly asks for one. Assume your files will be opened and edited by others after submission, so write them to last. If the task genuinely cannot be completed, call the `abandon_task_finish` tool with a brief reason instead. Use it only when you have concluded the work is impossible — not to escape a difficult task. </execution_context> <scenario_overview> {scenario_overview} </scenario_overview> <week_overview> {week_overview} </week_overview> <task_description> {task} </task_description> <deliverables> Submit these files, by exact name, saved directly in `/home/user`: {expected_output_filenames} </deliverables> Please begin working on the task now. - Binary rubric grading prompt:
You are grading a submitted deliverable against one binary rubric check. The user message contains: - the task instructions, - the rubric item, - the submitted artifact content. Submitted artifacts may appear as text blocks, image blocks, or parser notes for unsupported content. Use only evidence from the submitted artifact content. Do not infer facts from filenames, task instructions, or rubric text unless the submitted artifact content supports them. Beyond the task instructions and rubric in the user message, you only ever receive the submitted artifact itself, never the external source files it cites. Do not fail an item merely because you cannot open or cross-check a cited source — judge citations on whether they are present, specific, and well-formed in the submission, not on whether the source's contents can be independently confirmed. Return a strict binary judgment: - passed=true only if the pass criteria are satisfied. - passed=false if any required element is missing, materially wrong, unsupported, or not evidenced. Write concise reasoning that cites submitted artifact evidence or the absence of evidence. Do not award partial credit.
- Agent system prompt:
- Each task is graded against two styles of checks. Rubric checks are binary pass/fail criteria scored against a single submission. Pairwise checks compare two submissions for the same task and return a preferred submission or tie. There are two kinds: Analytical Quality (which output has deeper, better-structured analysis) and Presentation (which output is more professionally presented).
- All grading and pairwise comparisons are decided by one of a panel of three judges (Claude Opus 4.8 (max), GPT-5.5 (xhigh), and Gemini 3.1 Pro Preview) rather than a single judge, reducing bias toward submissions from the same model or model family. Each rubric verdict and pairwise comparison is decided by one judge sampled from the panel, with sampling balanced across checks and matches. To keep results comparable, a given rubric check is always graded by the same judge. AA-Briefcase Elo is the headline metric of this evaluation: it aggregates analytical quality Elo, presentation Elo, and rubric pass rate, with rubric performance converted into Elo via synthetic head-to-head matches using a maximum-likelihood Elo aggregation.
APEX-Agents-AA
- Description: APEX-Agents-AA is Artificial Analysis' independent implementation of Mercor's APEX-Agents benchmark. It evaluates long-horizon, cross-application agent work in professional services environments spanning investment banking, management consulting, and law
- Paper: https://arxiv.org/abs/2601.14242
- Dataset:
- We base our evaluation on the public APEX-Agents dataset from https://huggingface.co/datasets/mercor/apex-agents
- We evaluate 452 tasks from the public 480-task release (excluding Investment Banking Worlds 244 and 246, which have external runtime dependencies)
- Implementation:
- Each task is run with 3 repeats and scored using pass@1 - a repeat passes only if all rubric items are satisfied, and the leaderboard score is the average pass rate across repeats
- All models are run using our open source agentic harness, Stirrup, with a 200-turn cap per task
- Agents operate inside the Archipelago environment and access workplace tools through MCP servers exposed by its gateway
- The agent starts with a small meta-tool toolbelt and must explicitly manage MCP-backed tools using:
- List Tools – Shows which tools are currently available
- Inspect Tool – Inspects a tool before adding it
- Add Tool – Makes an MCP-backed tool available to the agent
- Remove Tool – Removes tools that are no longer needed
- The agent also receives:
- Todo Write - Creates or updates the agent's todo list. It can either replace the full list or merge updates by todo ID, and all todos must be completed or cancelled before final submission is accepted
- Finish - Submits the agent's final answer together with a completion status. It is the only way to submit a final answer, and only a completed Finish submission proceeds to grading
- MCP tool calls have a 60-second timeout. Tool outputs are truncated when needed to a 24k-token budget using a 20k-character head and 5k-character tail excerpt. Image inputs are compressed to approximately 1 MP before being returned to the model
- Grading is run locally with the Archipelago local file grader. Each repeat is graded against the task rubric using both the final answer submitted through Finish and the filesystem diff between initial and final world snapshots. A repeat passes only if every rubric item is satisfied. Gemini 3 Flash with 'low' reasoning is used as the LLM judge
ITBench-AA
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1)
- Description: ITBench-AA is Artificial Analysis' independent implementation of IBM's ITBench benchmark, evaluating AI agents on Site Reliability Engineering (SRE): Kubernetes incident root-cause analysis.
- Paper: https://arxiv.org/abs/2502.05352
- Repository: https://huggingface.co/datasets/ArtificialAnalysis/ITBench-AA
- Agent harness: https://github.com/ArtificialAnalysis/Stirrup
- Dataset:
- We evaluate 59 Kubernetes incident tasks: 40 from IBM's public ITBench SRE release and 19 private tasks shared with us by the ITBench team. The headline score is averaged across both splits
- Each task is an offline Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology, baked into a scenario-specific sandbox and mounted under
/home/user
- Implementation:
- Each task is run with 3 repeats. The primary score is precision at full recall a repeat receives 0.0 if it misses any ground-truth root-cause entity; otherwise it receives precision over the submitted entities.
- All models are run using our open-source agentic harness, Stirrup with a 100-turn cap per task. The agent loop informs the language model that its turn limit is approaching during the last 20 turns.
- The agent is given a single
run_shelltool to inspect the snapshot, plus afinishtool to submit its final answer. It must write a structured JSON diagnosis to/home/user/agent_output.jsoncontaining the minimal set of independent root-cause Kubernetes entities responsible for the incident, with reasoning and evidence for each, while excluding downstream symptoms - Grading uses an LLM judge only to normalize submitted
contributing_factorsonto the ground-truth canonical entities and alias groups. - After normalization, ground-truth alias groups are merged into scoring groups, so equivalent entities such as a pod and its corresponding deployment/service count as the same prediction. If any member of an alias group is marked as a root cause, the merged group is scored as a root-cause target; predicting multiple entities in the same alias group only counts once.
- Precision at full recall is computed as
0.0if any root-cause scoring group is missed. If no root-cause groups are missed, it istrue_positives / (true_positives + false_positives), where unmatched predictions and predictions mapped to non-root-cause groups count as false positives. - GPT-5.5 with medium reasoning effort is used as the grader model for comparing the model’s output with the ground truth for each task
- Generation prompt:
**Task**: You are an expert SRE (Site Reliability Engineer) and Kubernetes SRE Support Agent investigating a production incident from OFFLINE snapshot data. ==================================================================== # INCIDENT SNAPSHOT DATA LOCATION ==================================================================== Your incident data and working directory is located in - /home/user The final output must be written to /home/user/agent_output.json Available Python packages: - `drain3==0.9.11` - `numpy==2.4.5` - `pandas==3.0.3` Both `python` and `python3` are available and use the same environment. Your objective is to generate a **JSON diagnosis** identifying the root causes of the incident — the minimal set of independent Kubernetes entities whose failures directly explain the incident. Requirements: - Provide reasoning and evidence for every listed entity. - When the JSON file is ready, call the provided finish tool and submit `/home/user/agent_output.json`. All entities MUST use the format: `namespace/Kind/name` Examples: - `otel-demo/Deployment/ad` (Deployment named "ad" in namespace "otel-demo") - `otel-demo/Service/frontend` (Service named "frontend") - `cluster/Node/worker-node-1` (cluster-scoped resource) DO NOT include UIDs in the entity name. ==================================================================== ## Output Format ==================================================================== Output must consist solely of the final diagnosis in the specified JSON format below — do **not** include any additional text, markdown, or comments: ```json { "contributing_factors": [ { "name": "namespace/Kind/name", "reasoning": "A short, clear, human-readable explanation for why this entity is a root cause. Reference evidence where possible.", "evidence": "Concise summary of supporting facts — relevant alerts, events, logs, traces, or metrics. Plain string." } ] } ``` ==================================================================== # RULES FOR INCLUSION ==================================================================== **Only include an entity if both of the following are true:** 1. **There is qualifying evidence** — it appears in at least one of: a firing alert, a Kubernetes event, an error/warning log line, a metric anomaly, or trace evidence directly tied to the incident window. A passing mention in an unrelated log is not sufficient. 2. **It passes the irreducibility test** — you cannot fully explain its failure by pointing to another entity already in the list. Ask: *"If I remove this entity, does my explanation of the incident become incomplete?"* If yes, include it. If another entity already accounts for it, leave it out. **Do not include** downstream effects, symptoms, or intermediates — only the independent upstream causes. **Example (exhausted ResourceQuota blocking pod scheduling):** Causal chain: ResourceQuota exhausted → ReplicaSet cannot schedule pods → Deployment degraded - ✅ `otel-demo/ResourceQuota/otel-demo-mem-quota` — memory limit exhausted; directly blocks pod creation. Include. - ❌ `otel-demo/ReplicaSet/ad-7f9d4b` — failed only because the quota above was exhausted. Exclude. - ❌ `otel-demo/Deployment/ad` — degraded as a downstream consequence. Exclude. **Multiple entries are allowed only if they are truly independent** — two separate upstream causes that do not explain each other. When in doubt, prefer the most specific Kubernetes object that independently introduced the failure. ==================================================================== # INVESTIGATION WORKFLOW ==================================================================== ### Phase 1 — Context Discovery List available files (alerts, logs, events, topology). ### Phase 2 — Symptom Analysis Read all alert files. Compute: - Start time, End time, Duration, Frequency ### Phase 3 — Hypothesis Generation - Create initial hypotheses (e.g. "checkout pods OOMKilled", "redis latency spike"). - Create a validation plan for each hypothesis. ### Phase 4 — Evidence Collection Loop - Use tools (and generated python code) to gather log, event, metrics, trace evidence. - Validate or refute each hypothesis using real data. - Explain firing alerts as soon as you find supporting evidence. ### Phase 5 — Causal Chain Construction Build a causal chain like `[Config Error] → [CrashLoop] → [Service Down] → [Frontend 5xx]` ### Phase 6 — Conclusion Ensure: - All alerts are explained in the reasoning/evidence for the root causes, but do not add downstream entities only to account for alerts - All included entities pass the irreducibility test - JSON is written to `/home/user/agent_output.json` - Call the finish tool and submit the file - Grading prompt:
You are an expert AI evaluator specializing in Root Cause Analysis (RCA) for complex software systems. You will be provided with: 1. A **Ground Truth (GT)** JSON object containing entity definitions. 2. A **Generated Response** JSON object containing predicted entities. Your job is only to normalize generated entities to ground-truth entities. Ground Truth fields such as `groups`, `aliases`, `filter`, and `kind` may appear either at the top level of `GT` or under `GT.spec`. Treat `GT.spec` as the ground-truth payload when present. ----- ### Normalization Rules Before any downstream scoring can occur, you must accurately normalize entities from the `Generated Response` to the `Ground Truth`. This process must be based on **explicit evidence** from the entity's metadata. You must not infer or guess mappings based on an entity's position in a causal chain. Only normalize entities from `Generated Response.contributing_factors`. An entity from the `Generated Response` can only be mapped to a `Ground Truth` entity if a **Confident Match** can be established. **Definition of a Confident Match:** A generated entity is a confident match to a ground-truth entity only if its `name` field, or other explicit identifying metadata, clearly corresponds to the `filter` and `kind` of a ground-truth entity. **Alias Handling:** The `GT.aliases` field contains arrays of equivalent entity IDs. If a generated entity clearly matches an entity in an alias group, you may normalize it to the matching GT entity ID from that alias group. **Workload Kind Equivalence:** Treat `Deployment` and `Pod` as equivalent for normalization when the namespace and workload name correspond. For example, `otel-demo/Deployment/checkout` is a confident match for a GT `Pod` entity whose filter matches checkout pods in the `otel-demo` namespace. **Entity Name Format:** Generated entities use the format `namespace/Kind/name`. Examples: - `otel-demo/Deployment/flagd` - `otel-demo/Service/frontend` - `otel-demo/Pod/checkout-8546fdc74d-d68cn` Confident match examples: - A generated entity with `name: "otel-demo/Service/adservice"` is a confident match for the GT entity with `id: "ad-service-1"` and `filter: [".*adservice\\\\b"]`. - A generated entity with `name: "otel-demo/Service/adservice"` can match `ad-pod-1` only if the GT alias set makes that link explicit, for example `["ad-pod-1", "ad-service-1"]`. - If `GT.aliases` contains `["load-generator-pod-1", "load-generator-service-1"]`, then normalizing a generated `load-generator-service-1` match to that alias group is valid. - A generated `chaos-mesh/Schedule/...` entity whose name matches a GT filter is a confident match for the spawned chaos resource of any kind, provided name and namespace correspond. - A generated entity with `name: "67cbd7fe98a0776a"` and no other identifying evidence is not a confident match. If a generated entity does not have a confident match, leave it unmatched and set its normalized GT entity ID to `null`. Preserve the original order of the generated `contributing_factors`. ----- ### Output Format Return only a single JSON object with this shape: ```json { "contributing_factor_entities": [ { "submitted_entity_name": "namespace/Kind/name", "normalized_gt_entity_id": "ground-truth-entity-id-or-null", "reasoning": "brief explanation of why this is a confident match or why it is unmatched" } ] } ``` Rules: - Include one item for every generated entity in `contributing_factors`. - Preserve input order. - Use `normalized_gt_entity_id: null` when there is no confident match. - Return only valid JSON. Given the following Ground Truth (GT) and Generated Response, normalize the generated contributing-factor entities to the Ground Truth. ## Ground Truth (GT): ```json {ground_truth} ``` ## Generated Response: ```json {generated_response} ``` ## Task: 1. Look only at `Generated Response.contributing_factors`. 2. For each such entity, determine whether there is a confident match in the Ground Truth. 3. If there is a confident match, return the matched ground-truth entity ID. 4. If there is not a confident match, return `normalized_gt_entity_id: null`. 5. Do not score anything. Return only the normalization result JSON.
Coding
LiveCodeBench
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1)
- Description: Python programming to solve programming scenarios derived from LeetCode, AtCoder, and Codeforces
- Paper: https://arxiv.org/abs/2403.07974
- Dataset: https://huggingface.co/datasets/livecodebench/code_generation_lite
- Key details:
- Pass@1 evaluation criteria
- We do not apply LiveCodeBench custom system prompts
General
IFBench
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1). IFBench was removed from the Intelligence Index in v4.1, but we continue to run it on new model releases.
- Description: A benchmark that evaluates a model's ability to follow precise instructions in a single turn. It tests a wide range of skills, including counting, formatting, and sentence manipulation.
- Paper: https://arxiv.org/abs/2507.02833
- Dataset: https://huggingface.co/datasets/allenai/IFBench_test
- Implementation:
- Uses the single-turn IFBench dataset, which contains 294 questions
- We run 5 repeats for each question with pass@1 scoring
- We evaluate responses using the official source code from allenai/IFBench
- We employ the loose evaluation mode to robustly assess instruction-following, which accounts for extraneous text or formatting by checking several variations of the model's output (e.g., with and without the first and last lines, and with asterisks removed)
- Our score represents the prompt level accuracy (average across all questions and repeats)
- We do not use the multi-turn version of IFBench, which uses a different dataset
MMLU-Pro (Multi-Task Language Understanding Benchmark, Pro version)
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1)
- Description: Comprehensive evaluation of advanced knowledge across domains, adapted from original MMLU
- Paper: https://arxiv.org/abs/2406.01574
- Dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
- Key details:
- 10 option multiple choice format
- Regex-based answer extraction with pass@1 scoring (prompt and regex below)
Other
Global-MMLU-Lite
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1); powers the Artificial Analysis Multilingual Index
- Description: A lightweight, multilingual version of MMLU designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts
- Dataset: CohereLabs/Global-MMLU-Lite
- Key details:
- ~6,000 questions (~400 per supported language)
- Multiple Choice (4 options)
- Regex extraction, pass@1
MMMU Pro
- Status: Standalone evaluation (not part of Artificial Analysis Intelligence Index v4.1); a multimodal (visual) reasoning benchmark
- Description: An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines
- Dataset: MMMU/MMMU_Pro
- Key details:
- 1,730 questions
- Multiple Choice (10 options)
- Regex extraction, pass@1
Legacy Evaluations
Evaluations we have retired or superseded. We keep their methodology here for reference and historical comparability; they are no longer part of the Artificial Analysis Intelligence Index or our active reporting.
Terminal-Bench Hard
- Note: Superseded by Terminal-Bench v2.1, which we use going forward. Terminal-Bench Hard was a constituent of the Artificial Analysis Intelligence Index prior to v4.1
- Description: An agentic benchmark developed by Stanford University researchers, the Laude Institute, and the open source community, released in 2025. Terminal-Bench evaluates the ability of agents and models to solve a wide variety of tasks (including software engineering, system administration, and game-playing scenarios) using a terminal interface
- Page: https://www.tbench.ai/
- Dataset registry: https://www.tbench.ai/registry
- Implementation:
- We implement the 'hard' subset of the terminal-bench-core dataset, with the latest dataset version as of 14 August 2025 (commit 74221fb); we evaluate 44 tasks from this subset (a small number of tasks are excluded due to external dependency issues in the original dataset)
- We evaluate this 'hard' subset using the Terminus 2 agent harness for consistency between models, and score models based on pass@1 scoring with the overall average over 3 repeats for each task
- In the Terminal-Bench framework, each task has a specific suite of tests applied, and are considered successful if all tests pass, or unsuccessful otherwise
- We apply the following constraints on evaluations for the agent:
- Maximum 'episodes' (where the model reviews current state and plans a series of next actions at the terminal) are limited to 100
- We set a global per-task timeout of two hours (7,200 seconds); in practice the 100-episode limit is the binding constraint
- Models are limited to a maximum of 1 million cumulative input tokens per repeat of each task
- In our testing these constraints predominantly limit cases where models are stuck in an unsuccessful loop, and we see no consistent differences in performance due to these constraints
- aimo-airline-departures
- blind-maze-explorer-5x5
- cartpole-rl-training
- chem-property-targeting
- chem-rf
- circuit-fibsqrt
- cobol-modernization
- configure-git-webserver
- cross-entropy-method
- extract-moves-from-video
- feal-differential-cryptanalysis
- feal-linear-cryptanalysis
- form-filling
- git-multibranch
- gpt2-codegolf
- install-windows-xp
- make-doom-for-mips
- make-mips-interpreter
- model-extraction-relu-logits
- movie-helper
- neuron-to-jaxley-conversion
- oom
- organization-json-generator
- parallel-particle-simulator
- parallelize-graph
- password-recovery
- path-tracing
- path-tracing-reverse
- play-zork
- play-zork-easy
- polyglot-rust-c
- prove-plus-comm
- pytorch-model-cli
- rare-mineral-allocation
- recover-obfuscated-files
- reverse-engineering
- run-pdp11-code
- stable-parallel-kmeans
- super-benchmark-upet
- swe-bench-astropy-1
- swe-bench-astropy-2
- train-fasttext
- word2vec-from-scratch
- write-compressor
𝜏²-Bench Telecom
- Note: Superseded by 𝜏³-Banking, which we use going forward. 𝜏²-Bench Telecom was a constituent of the Artificial Analysis Intelligence Index prior to v4.1
- Description: Benchmark developed by Sierra for conversational AI agents in 'dual control' scenarios with language models simulating both agent and user roles to test planning, tool use, and guidance/communication
- Paper: https://arxiv.org/abs/2506.07982
- Blog: sierra.ai/resources/research/tau-squared-bench
- Dataset: https://github.com/sierra-research/tau2-bench
- Implementation:
- The 'telecom' domain introduced in 𝜏²-Bench contains 114 tasks (subsampled from a total 2,285 programmatically generated tasks), with varying 'intents' describing if the task is related to service, mobile data, or MMS issues. We evaluate the telecom domain in full with 3 repeats per task, and report the score using pass@1 scoring as the average of the 3 attempts
- In this benchmark, the outcome 'world state' decides whether the agent succeeded - for example, whether the user's cell phone data is functioning after the agent completes the task
- The full 𝜏²-Bench suite includes 3 execution modes with varying planning and communication levels in ablation studies; we implement the 'default' dual control mode with fully simulated and separate user and assistant agents
- We use Qwen3 235B A22B 2507 (Non-reasoning) for the user agent simulator to ensure consistent checkpoint availability and full control over inference settings alongside strong base intelligence
- We apply a constraint on execution to limit steps to a maximum of 100 per task repeat
MATH-500
- Note: Retired from the Artificial Analysis Intelligence Index and our active reporting.
- Description: A 500-problem subset of the MATH benchmark spanning high-school competition mathematics across a range of subjects and difficulty levels.
- Dataset: huggingface.co/datasets/HuggingFaceH4/MATH-500
AIME 2025 (American Invitational Mathematics Examination)
- Note: Retired from our active reporting; no longer part of Artificial Analysis Intelligence Index v4.1.
- Description: Advanced mathematical problem-solving dataset from the 2025 American Invitational Mathematics Examination.
- Dataset: 2025 AIME I & 2025 AIME II
- Key details:
- Strict numerical answer format (integer 1–999)
- Pass@1 scoring with 10 repeats per question
- Script-based grading with SymPy normalization + equality checker LLM as backup
Prompt Templates, Answer Extraction and Evaluation
Multiple Choice Questions (GPQA, MMLU-Pro)
We prompt multi-choice evals with the following instruction prompt. This prompt was independently developed by Artificial Analysis, and carefully validated with various ablation studies. We assess that this prompt is a clearer, and therefore fairer, approach than traditional completion-style multi-choice evaluation methodologies or other instruction prompts we tested.
GPQA uses four options (A–D). MMLU-Pro uses ten options (A–J); we use the same structure with additional choices.
Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D' (e.g. 'Answer: A').
{Question}
A) {A}
B) {B}
C) {C}
D) {D}Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A').
{Question}
A) {A}
B) {B}
C) {C}
D) {D}
E) {E}
F) {F}
G) {G}
H) {H}
I) {I}
J) {J}Multiple Choice Extraction Regex
We extract multiple choice answers using a multi-stage approach to handle various answer formats. For single-letter responses, we use the letter directly. Otherwise, we first attempt to match our primary pattern that looks for formal "Answer: X" format (accounting for optional markdown formatting):
Primary Pattern:
(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])If the primary pattern fails, we attempt the following fallback patterns in sequence to catch various answer formats:
- LaTeX boxed notation (e.g., \boxed{A} or \boxed{The answer is A})
\boxed\{[^}]*([A-Z])[^}]*\} - Natural language (e.g., "answer is B")
answer is ([a-zA-Z]) - With parenthesis (e.g., "answer is (C")
answer is \\(([a-zA-Z]) - Choice format (e.g., "D) some answer text")
([A-Z])\)\s*[^A-Z]* - Explicit statement (e.g., "E is the correct answer")
([A-Z])\s+is\s+the\s+correct\s+answer - Standalone letter at end of response
([A-Z])\s*$ - Letter followed by period (e.g., "F.")
([A-Z])\s*\. - Letter followed by non-word character
([A-Z])\s*[^\w]
We always take the last match found to account for self-correction in responses.
Equality Checker LLM
For evaluations with open-ended answers (HLE, AA-LCR), we use an equality checker LLM to determine if a model's response is semantically equivalent to the correct answer. This approach uses a language model to assess whether two answers have the same meaning, even if phrased differently. The equality checker evaluates semantic equivalence rather than requiring exact string matches, which is particularly important for questions where multiple valid phrasings exist.
Different evaluations use different models as equality checkers, selected based on validation against human judgement. Specific equality checker prompts for each evaluation are detailed in the sections below.
HLE (Humanity's Last Exam)
We prompt HLE with the following text as a system prompt, with a difference between Exact Answer and Multiple Choice questions - these prompts are used with no changes from the original implementation by Hendrycks et al. For models that do not support a system prompt, we prepend the system prompt to the user prompt.
System Prompt for Exact Answer Questions:
"Your response should be in the following format:
Explanation: {your explanation for your final answer}
Exact Answer: {your succinct, final answer}
Confidence: {your confidence score between 0% and 100% for your answer}"
System Prompt for Multiple Choice Questions:
"Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}"HLE Equality Checker Prompt
We use the following prompt on GPT-4o (Aug 2024) to check if the response answer is correct - this prompt is again used with no changes from the original implementation by Hendrycks et al.
Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.
[question]: {question}
[response]: {response}
Your judgement must be in the format and criteria specified below:
extracted_final_answer: The final exact answer extracted from the [response]. Put the extracted answer as 'None' if there is no exact, final answer to extract from the response.
[correct_answer]: {correct_answer}
reasoning: Explain why the extracted_final_answer is correct or incorrect based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
confidence: The extracted confidence score between 0|\%| and 100|\%| from [response]. Put 100 if there is no confidence score available.LCR Equality Checker Prompt
Assess whether the following CANDIDATE ANSWER is CORRECT or INCORRECT.
For the CANDIDATE ANSWER to be correct, it must be consistent with the OFFICIAL ANSWER.
The question, for reference only: {question}
The OFFICIAL ANSWER: {official_answer}
CANDIDATE ANSWER TO ASSESS: {candidate_answer}
Reply only with CORRECT or INCORRECT.Mathematical Questions (AIME 2025)
We prompt AIME with the following instruction prompt:
Solve the following math problem step by step. Put your answer inside \\boxed{{}}.
{Question}
Remember to put your answer inside \\boxed{{}}.Mathematical Equality Checker Prompt
As described above, we supplement our script-based grading with a language model equality checker. We use the following prompt with Llama 3.3 70B to check if two answers are equivalent. This prompt was developed by OpenAI and released in their simple-evals repository.
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
Yes
Expression 1: 3/2
Expression 2: 1.5
Yes
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
No
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
Yes
Expression 1: 3245/5
Expression 2: 649
No
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
Yes
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
Yes
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
Yes
(give benefit of the doubt to units)
---
YOUR TASK
Respond with only "Yes" or "No" (without quotes). Do not include a rationale.
Expression 1: %(expression1)s
Expression 2: %(expression2)s
Code Generation Tasks
SciCode
We prompt SciCode with the following prompt, used with no changes from the original implementation of the Scientist Annotated Background prompt by Tian et al.
PROBLEM DESCRIPTION:
You will be provided with problem steps along with background knowledge necessary for solving the problem. Your task will be to develop a Python solution focused on the next step of the problem-solving process.
PROBLEM STEPS AND FUNCTION CODE:
Here, you'll find the Python code for the initial steps of the problem-solving process. This code is integral to building the solution.
{problem_steps_str}
NEXT STEP - PROBLEM STEP AND FUNCTION HEADER:
This part will describe the next step in the problem-solving process. A function header will be provided, and your task is to develop the Python code for this next step based on the provided description and function header.
{next_step_str}
DEPENDENCIES:
Use only the following dependencies in your solution. Do not include these dependencies at the beginning of your code.
{dependencies}
RESPONSE GUIDELINES:
Now, based on the instructions and information provided above, write the complete and executable Python program for the next step in a single block.
Your response should focus exclusively on implementing the solution for the next step, adhering closely to the specified function header and the context provided by the initial steps.
Your response should NOT include the dependencies and functions of all previous steps. If your next step function calls functions from previous steps, please make sure it uses the headers provided without modification.
DO NOT generate EXAMPLE USAGE OR TEST CODE in your response. Please make sure your response python code in format of ```python```.LiveCodeBench
We prompt LiveCodeBench with the following prompt, used with no changes from the original implementation of the LiveCodeBench prompt by the original team. We note, however, that we do not apply the custom system prompts that the LiveCodeBench team uses - we do not use their generic sytem prompt nor their custom system prompts for certain models.
Questions with starter code:
### Question:
{question.question_content}
### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.
```python
{question.starter_code}
```
### Answer: (use the provided format with backticks)
Questions without starter code:
### Question:
{question.question_content}
### Format: Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT.
```python
# YOUR CODE HERE
```
### Answer: (use the provided format with backticksCode Extraction Regex
We extract the code from the response using the following regex:
(?<=```python\n)((?:\n|.)+?)(?=\n```)Version History
Version 4.1
June 2026—current
- Upgraded GDPval-AA to GDPval-AA v2: upgraded sandbox with new and expanded dependencies, Elo scores re-baselined to human expert performance at 1000, panel of three frontier LLM judges, and turn limits expanded to 250 turns with the ability to exit early
- Replaced Terminal-Bench Hard with Terminal-Bench v2.1 (higher turn limits, no token limits)
- Replaced 𝜏²-Bench Telecom with 𝜏³-Banking
- Removed IFBench from the Intelligence Index (we continue to run it on new model releases)
- Adjusted category weights to further emphasize agentic tasks: Agents (34%), Coding (24%), Scientific Reasoning (24%), General (18%), with AA-Omniscience split into Accuracy (8%) and Non-Hallucination (4%) components
- Upgraded token and cost metrics to better reflect real costs, including cache hit rates and cache token pricing
Version 4.0.4
March 2026—June 2026
- Updated grader model for GDPval-AA to Gemini 3.1 Pro Preview after deprecation of previous grader model Gemini 3 Pro Preview
Version 4.0.3
February 2026—March 2026
- Updated grader model for Omniscience to Gemini 3 Flash Preview (Reasoning) after deprecation of previous grader model Gemini 2.5 Flash (09-2025) (Reasoning)
Version 4.0.2
January 2026—February 2026
- Re-anchored GDPval-AA Elo scores in the Intelligence Index to the latest values following a revision to improve robustness to rare code sandbox failures
Version 4.0.1
January 2026
- Refined Terminal-Bench Hard evaluation to 44 tasks, removing a small set of tasks due to external dependency issues in original dataset at the pinned commit
Version 4.0
January 2026
- Added GDPval-AA (real-world knowledge work)
- Added AA-Omniscience (knowledge and hallucination)
- Added CritPt (physics reasoning)
- Removed MMLU-Pro, LiveCodeBench, AIME 2025 from Intelligence Index
- New category-based weighting structure: Agents (25%), Coding (25%), General (25%), Scientific Reasoning (25%)
Version 3.0
2 September 2025—December 2025
- Added Terminal-Bench Hard (agentic workflows)
- Added 𝜏²-Bench Telecom (agentic workflows)
- Included MMLU-Pro and LiveCodeBench in Intelligence Index
- Updated weightings
Version 2.2
6 August 2025—1 September 2025
- Added Artificial Analysis Long Context Reasoning
- Updated weightings
Version 2.1
5 August 2025—6 August 2025
- Added IFBench
- Added AIME 2025
- Removed MATH-500
- Removed AIME 2024
- Updated weightings
Version 2.0
11 February 2025—4 August 2025
Version 1.0—1.3
January 2024—10 February 2025