All evaluations

ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.
See example tasks

ITBench-AA is Artificial Analysis' independent implementation of IBM's ITBench benchmark, evaluating AI agents on Site Reliability Engineering (SRE): Kubernetes incident root-cause analysis.
We evaluate 59 Kubernetes incident tasks (40 from IBM's public release and 19 private shared with us by the ITBench team), with 3 repeats per task. In each scenario the agent is given an offline snapshot of a Kubernetes incident (alerts, events, traces, metrics, and application topology) and must produce a structured JSON diagnosis identifying the root cause entities (deployments, services, pods, namespaces, network policies, etc.) responsible for the failure.
The eval is run by an agent driven by Stirrup, our open-source agent harness, in a sandboxed code-execution environment with the per-task Kubernetes snapshot mounted into the agent's workspace. The agent is given a shell access to inspect the snapshot and write its final diagnosis to a JSON output file, which is then scored against the ground truth. The headline score is the average precision at full recall. For each task, the model scores TP / (TP + FP) when there are zero false negatives, and 0 otherwise.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O. Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, and .

Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that state-of-the-art models face significant difficulties with ITBench, with the best-performing agent achieving only 13.8% resolution rate for SRE scenarios, 25.2% for CISO scenarios, and a notable 0% for FinOps. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.

ITBench-AA Average precision at full recall

Claude Opus 4.7 (Adaptive Reasoning, Max Effort) scores the highest on ITBench-AA Average precision at full recall with a score of 46.7%, followed by GPT-5.5 (xhigh) with a score of 45.8%, and Qwen3.7 Max with a score of 42.5%

Results

ITBench-AA Benchmark Leaderboard: Results

Combined public and private Site Reliability Engineering (SRE) tasks 路 Independently benchmarked by Artificial Analysis

ITBench-AA: Average Turns per Task

Average number of agent turns used per task 路 100-turn cap 路 For Site Reliability Engineering (SRE) tasks
Reasoning models are indicated by a lightbulb icon

ITBench-AA: Output Tokens per Task

Mean output tokens per task 路 Split by reasoning and answer tokens 路 Lower is better 路 For Site Reliability Engineering (SRE) tasks

Score vs. Release Date

ITBench-AA Benchmark Leaderboard: Score vs. Release Date

Most attractive region
Alibaba
Anthropic
DeepSeek
Google
Kimi
MiniMax
OpenAI
xAI
Xiaomi
Z AI

Example Tasks

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

饾湉虏-Bench Telecom Benchmark Leaderboard饾湉虏-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark LeaderboardITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.