All evaluations

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
See example tasks

GDPval-AA uses 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.
The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval-AA

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) scores the highest on GDPval-AA with a score of 1932, followed by Claude Opus 4.8 (Adaptive Reasoning, Max Effort) with a score of 1890, and GPT-5.5 (xhigh) with a score of 1769

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis
Stirrup Agent Harness
AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation
AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index
Most attractive quadrant

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Output Tokens

GDPval-AA: Output Token Usage

Output tokens used to run the evaluation
Reasoning models are indicated by a lightbulb icon

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task
Reasoning models are indicated by a lightbulb icon

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region

GDPval-AA Leaderboard

Creator
Name
Elo
CI
Release Date
1
Anthropic logoAnthropic
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
1932-36 / +39Jun 2026
2
Anthropic logoAnthropic
Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
1890-34 / +35May 2026
3
OpenAI logoOpenAI
GPT-5.5 (xhigh)
1769-32 / +31Apr 2026
4
Anthropic logoAnthropic
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
1753-41 / +40Apr 2026
5
OpenAI logoOpenAI
GPT-5.5 (high)
1747-26 / +29Apr 2026
6
Anthropic logoAnthropic
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
1676-26 / +29Feb 2026
7
OpenAI logoOpenAI
GPT-5.4 (xhigh)
1674-34 / +32Mar 2026
8
Anthropic logoAnthropic
Claude Opus 4.7 (Non-reasoning, High Effort)
1670-26 / +28Apr 2026
9
MiniMax logoMiniMax
MiniMax-M3
1668-26 / +29Jun 2026
10
Google logoGoogle
Gemini 3.5 Flash (medium)
1657-28 / +31May 2026
11
Google logoGoogle
Gemini 3.5 Flash (high)
1656-26 / +30May 2026
12
OpenAI logoOpenAI
GPT-5.5 (medium)
1648-25 / +27Apr 2026
13
Anthropic logoAnthropic
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
1619-31 / +33Feb 2026
14
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, High Effort)
1596-25 / +26Feb 2026
15
Anthropic logoAnthropic
Claude Opus 4.6 (Non-reasoning, High Effort)
1587-24 / +25Feb 2026
16
Xiaomi logoXiaomi
MiMo-V2.5-Pro
1571-27 / +28Apr 2026
17
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, High Effort)
1558-29 / +31Apr 2026
18
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, Max Effort)
1554-29 / +29Apr 2026
19
Xiaomi logoXiaomi
MiMo-V2.5
1545-23 / +26Apr 2026
20
Alibaba logoAlibaba
Qwen3.7 Max
1541-22 / +27May 2026
21
Z AI logoZ AI
GLM-5.1 (Reasoning)
1535-0 / +0Apr 2026
22
Alibaba logoAlibaba
Qwen3.7 Plus
1517-23 / +26Jun 2026
23
MiniMax logoMiniMax
MiniMax-M2.7
1505-24 / +26Mar 2026
24
Alibaba logoAlibaba
Qwen3.6 Max Preview
1504-20 / +21Apr 2026
25
OpenAI logoOpenAI
GPT-5.4 (low)
1503-27 / +29Mar 2026
26
xAI logoxAI
Grok 4.3 (high)
1495-25 / +23Apr 2026
27
Z AI logoZ AI
GLM-5-Turbo
1494-22 / +23Mar 2026
28
Z AI logoZ AI
GLM-5.1 (Non-reasoning)
1488-25 / +28Apr 2026
29
Kimi logoKimi
Kimi K2.6
1481-25 / +26Apr 2026
30
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Non-reasoning)
1480-26 / +27Apr 2026
31
OpenAI logoOpenAI
GPT-5.3 Codex (xhigh)
1479-25 / +24Feb 2026
32
OpenAI logoOpenAI
GPT-5.2 (xhigh)
1466-24 / +26Dec 2025
33
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, Low Effort)
1458-22 / +23Feb 2026
34
Anthropic logoAnthropic
Claude Opus 4.5 (Reasoning)
1446-23 / +24Nov 2025
35
OpenAI logoOpenAI
GPT-5.5 (low)
1440-23 / +25Apr 2026
36
Google logoGoogle
Gemini 3.5 Flash (minimal)
1438-24 / +26May 2026
37
OpenAI logoOpenAI
GPT-5.4 mini (xhigh)
1438-23 / +26Mar 2026
38
Anthropic logoAnthropic
Claude Opus 4.5 (Non-reasoning)
1418-20 / +23Nov 2025
39
Meta logoMeta
Muse Spark
1417-23 / +23Apr 2026
40
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, High Effort)
1414-25 / +26Apr 2026
41
OpenAI logoOpenAI
GPT-5.2 (medium)
1404-21 / +23Dec 2025
42
Xiaomi logoXiaomi
MiMo-V2-Pro
1404-24 / +24Mar 2026
43
Alibaba logoAlibaba
Qwen3.6 27B (Reasoning)
1402-23 / +25Apr 2026
44
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Non-reasoning)
1395-23 / +27Apr 2026
45
Z AI logoZ AI
GLM-5 (Reasoning)
1391-22 / +24Feb 2026
46
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, Max Effort)
1388-20 / +34Apr 2026
47
Alibaba logoAlibaba
Qwen3.6 27B (Non-reasoning)
1384-23 / +23Apr 2026
48
NVIDIA logoNVIDIA
Nemotron 3 Ultra 550B A55B (Reasoning)
1380-23 / +25Jun 2026
49
Alibaba logoAlibaba
Qwen3.6 Plus
1350-21 / +25Apr 2026
50
Xiaomi logoXiaomi
MiMo-V2-Omni-0327
1347-24 / +23Mar 2026
51
OpenAI logoOpenAI
GPT-5.4 (Non-reasoning)
1342-26 / +26Mar 2026
52
Z AI logoZ AI
GLM 5V Turbo (Reasoning)
1328-22 / +23Apr 2026
53
Kimi logoKimi
Kimi K2.6 (Non-reasoning)
1326-26 / +27Apr 2026
54
Google logoGoogle
Gemini 3 Deep Think
1324-30 / +31Feb 2026
55
Z AI logoZ AI
GLM-5 (Non-reasoning)
1323-24 / +22Feb 2026
56
 logo
Claude Pro - 4.5 Opus (Extended Thinking)
1319-41 / +38-
57
Anthropic logoAnthropic
Claude 4.5 Sonnet (Reasoning)
1316-21 / +25Sep 2025
58
Xiaomi logoXiaomi
MiMo-V2-Omni
1316-22 / +23Mar 2026
59
OpenAI logoOpenAI
GPT-5.4 mini (medium)
1315-20 / +24Mar 2026
60
Google logoGoogle
Gemini 3.1 Pro Preview
1314-26 / +27Feb 2026
61
OpenAI logoOpenAI
GPT-5.5 (Non-reasoning)
1312-23 / +27Apr 2026
62
xAI logoxAI
Grok 4.3 (medium)
1311-24 / +24Apr 2026
63
Anthropic logoAnthropic
Claude 4.5 Sonnet (Non-reasoning)
1306-21 / +23Sep 2025
64
xAI logoxAI
Grok 4.3 (Non-reasoning)
1302-23 / +25Apr 2026
65
StepFun logoStepFun
Step 3.7 Flash
1300-27 / +26May 2026
66
Xiaomi logoXiaomi
MiMo-V2.5-Pro (Non-reasoning)
1298-22 / +23Apr 2026
67
Alibaba logoAlibaba
Qwen3.6 35B A3B (Reasoning)
1297-24 / +24Apr 2026
68
OpenAI logoOpenAI
GPT-5 (high)
1291-20 / +23Aug 2025
69
OpenAI logoOpenAI
GPT-5.2 Codex (xhigh)
1287-24 / +27Dec 2025
70
Kimi logoKimi
Kimi K2.5 (Reasoning)
1283-22 / +23Jan 2026
71
Kimi logoKimi
Kimi K2.5 (Non-reasoning)
1265-22 / +23Jan 2026
72
Tencent logoTencent
Hy3-preview (Reasoning)
1235-22 / +23Apr 2026
73
OpenAI logoOpenAI
GPT-5.1 (high)
1228-20 / +22Nov 2025
74
Tencent logoTencent
Hy3-preview (Non-reasoning)
1226-25 / +25Apr 2026
75
Alibaba logoAlibaba
Qwen3.6 35B A3B (Non-reasoning)
1226-21 / +23Apr 2026
76
OpenAI logoOpenAI
GPT-5.2 (Non-reasoning)
1222-23 / +22Dec 2025
77
Alibaba logoAlibaba
Qwen3.5 397B A17B (Non-reasoning)
1216-20 / +23Feb 2026
78
OpenAI logoOpenAI
GPT-5 Codex (high)
1216-22 / +23Sep 2025
79
Google logoGoogle
Gemini 3 Flash Preview (Reasoning)
1204-24 / +24Dec 2025
80
OpenAI logoOpenAI
GPT-5.4 nano (medium)
1198-21 / +23Mar 2026
81
DeepSeek logoDeepSeek
DeepSeek V3.2 (Reasoning)
1197-24 / +22Dec 2025
82
OpenAI logoOpenAI
GPT-5.4 nano (xhigh)
1195-23 / +24Mar 2026
83
OpenAI logoOpenAI
GPT-5.1 Codex (high)
1191-26 / +23Nov 2025
84
Alibaba logoAlibaba
Qwen3.5 397B A17B (Reasoning)
1190-23 / +22Feb 2026
85
OpenAI logoOpenAI
GPT-5 mini (high)
1186-23 / +23Aug 2025
86
Google logoGoogle
Gemini 3 Pro Preview (high)
1184-22 / +22Nov 2025
87
Alibaba logoAlibaba
Qwen3.5 Omni Plus
1183-20 / +22Mar 2026
88
Z AI logoZ AI
GLM-4.7 (Reasoning)
1182-23 / +22Dec 2025
89
MiniMax logoMiniMax
MiniMax-M2.5
1177-21 / +23Feb 2026
90
Z AI logoZ AI
GLM-4.7 (Non-reasoning)
1175-23 / +24Dec 2025
91
Anthropic logoAnthropic
Claude 4.5 Haiku (Reasoning)
1171-24 / +25Oct 2025
92
xAI logoxAI
Grok 4.20 0309 v2 (Reasoning)
1170-21 / +24Apr 2026
93
Mistral logoMistral
Mistral Medium 3.5
1168-25 / +24Apr 2026
94
Google logoGoogle
Gemini 3 Pro Preview (low)
1166-26 / +29Nov 2025
95
Alibaba logoAlibaba
Qwen3.5 27B (Reasoning)
1160-21 / +23Feb 2026
96
Alibaba logoAlibaba
Qwen3.5 27B (Non-reasoning)
1159-22 / +23Feb 2026
97
OpenAI logoOpenAI
GPT-5 (low)
1150-23 / +23Aug 2025
98
 logo
ChatGPT Plus - 5.1 Thinking (Extended Thinking)
1149-41 / +45-
99
OpenAI logoOpenAI
GPT-5.5 Instant (May 2026)
1144-23 / +23May 2026
100
Anthropic logoAnthropic
Claude 4.5 Haiku (Non-reasoning)
1136-26 / +27Oct 2025
101
Alibaba logoAlibaba
Qwen3 Max Thinking
1134-22 / +23Jan 2026
102
Anthropic logoAnthropic
Claude 4 Sonnet (Reasoning)
1130-27 / +26May 2025
103
xAI logoxAI
Grok 4.3 (low)
1126-21 / +24Apr 2026
104
Anthropic logoAnthropic
Claude 4 Sonnet (Non-reasoning)
1123-22 / +24May 2025
105
InclusionAI logoInclusionAI
Ring-2.6-1T
1121-24 / +26May 2026
106
KwaiKAT logoKwaiKAT
KAT Coder Pro V2
1116-22 / +22Mar 2026
107
Alibaba logoAlibaba
Qwen3.5 122B A10B (Reasoning)
1115-22 / +23Feb 2026
108
Google logoGoogle
Gemini 3 Flash Preview (Non-reasoning)
1114-26 / +24Dec 2025
109
Google logoGoogle
Gemma 4 31B (Reasoning)
1113-22 / +23Apr 2026
110
Alibaba logoAlibaba
Qwen3.5 122B A10B (Non-reasoning)
1112-24 / +22Feb 2026
111
MiniMax logoMiniMax
MiniMax-M2.1
1092-25 / +23Dec 2025
112
Xiaomi logoXiaomi
MiMo-V2-Flash (Reasoning)
1079-22 / +24Dec 2025
113
China Mobile logoChina Mobile
JT-35B-Flash
1076-22 / +24May 2026
114
DeepSeek logoDeepSeek
DeepSeek V3.1 (Non-reasoning)
1075-23 / +25Aug 2025
115
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Reasoning)
1070-25 / +22Sep 2025
116
StepFun logoStepFun
Step 3.5 Flash 2603
1067-23 / +23Apr 2026
117
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Non-reasoning)
1066-24 / +25Sep 2025
118
Xiaomi logoXiaomi
MiMo-V2-Flash (Non-reasoning)
1058-25 / +24Dec 2025
119
OpenAI logoOpenAI
GPT-5.1 Codex mini (high)
1053-22 / +25Nov 2025
120
StepFun logoStepFun
Step 3.5 Flash
1051-25 / +26Feb 2026
121
Anthropic logoAnthropic
Claude 3.7 Sonnet (Reasoning)
1047-24 / +25Feb 2025
122
Alibaba logoAlibaba
Qwen3.5 35B A3B (Non-reasoning)
1046-25 / +22Feb 2026
123
Anthropic logoAnthropic
Claude 3.7 Sonnet (Non-reasoning)
1046-22 / +23Feb 2025
124
xAI logoxAI
Grok 4.1 Fast (Reasoning)
1046-21 / +26Nov 2025
125
InclusionAI logoInclusionAI
Ling-2.6-1T
1044-22 / +23Apr 2026
126
Xiaomi logoXiaomi
MiMo-V2-Flash (Feb 2026)
1042-25 / +24Dec 2025
127
xAI logoxAI
Grok 4.20 0309 (Reasoning)
1040-22 / +23Mar 2026
128
Alibaba logoAlibaba
Qwen3 Max
1037-22 / +24Sep 2025
129
xAI logoxAI
Grok 4.20 0309 v2 (Non-reasoning)
1036-24 / +24Apr 2026
130
 logo
Perplexity Pro - Labs
1032-41 / +39-
131
MiniMax logoMiniMax
MiniMax-M2
1031-26 / +26Oct 2025
132
Z AI logoZ AI
GLM-4.6 (Reasoning)
1029-27 / +29Sep 2025
133
xAI logoxAI
Grok 4 Fast (Reasoning)
1015-22 / +23Sep 2025
134
Google logoGoogle
Gemma 4 26B A4B (Reasoning)
1013-24 / +22Apr 2026
135
OpenAI logoOpenAI
o4-mini (high)
1007-23 / +24Apr 2025
136
Google logoGoogle
Gemma 4 31B (Non-reasoning)
1005-20 / +21Apr 2026
137
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Reasoning)
1005-25 / +26Sep 2025
138
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)
1003-22 / +21Mar 2026
139
OpenAI logoOpenAI
GPT-5 (medium)
1001-26 / +26Aug 2025
140
OpenAI logoOpenAI
GPT-5 mini (medium)
1001-24 / +25Aug 2025
141
OpenAI logoOpenAI
GPT-5.4 mini (Non-Reasoning)
1001-22 / +21Mar 2026
142
OpenAI logoOpenAI
GPT-5.1 (Non-reasoning)
1000-0 / +0Nov 2025
143
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Reasoning)
999-24 / +24Sep 2025
144
MiniMax logoMiniMax
MiniMax M1 80k
994-24 / +25Jun 2025
145
xAI logoxAI
Grok 4
991-22 / +24Jul 2025
146
Kimi logoKimi
Kimi K2 Thinking
990-25 / +25Nov 2025
147
Z AI logoZ AI
GLM-4.6 (Non-reasoning)
985-23 / +23Sep 2025
148
ByteDance Seed logoByteDance Seed
Doubao Seed Code
985-26 / +28Nov 2025
149
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Non-reasoning)
974-24 / +26Sep 2025
150
Amazon logoAmazon
Nova 2.0 Pro Preview (medium)
973-27 / +23Nov 2025
151
 logo
Google AI Pro - Thinking with 3 Pro
972-43 / +43-
152
Inception logoInception
Mercury 2
960-22 / +24Feb 2026
153
Google logoGoogle
Gemma 4 26B A4B (Non-reasoning)
949-24 / +24Apr 2026
154
OpenAI logoOpenAI
gpt-oss-120b (high)
947-28 / +27Aug 2025
155
Alibaba logoAlibaba
Qwen3 Max Thinking (Preview)
946-25 / +26Nov 2025
156
OpenAI logoOpenAI
GPT-5.4 nano (Non-Reasoning)
940-32 / +29Mar 2026
157
Google logoGoogle
Gemini 3.1 Flash-Lite
926-22 / +24Mar 2026
158
Cohere logoCohere
Command A+
918-24 / +26May 2026
159
Google logoGoogle
Gemini 2.5 Pro
918-24 / +24Jun 2025
160
Alibaba logoAlibaba
Qwen3 Coder Next
910-25 / +24Feb 2026
161
xAI logoxAI
Grok 4.20 0309 (Non-reasoning)
907-22 / +23Mar 2026
162
Alibaba logoAlibaba
Qwen3.5 35B A3B (Reasoning)
905-22 / +23Feb 2026
163
Alibaba logoAlibaba
Qwen3.5 Omni Flash
894-25 / +24Mar 2026
164
 logo
SuperGrok - Grok 4
882-46 / +40-
165
DeepSeek logoDeepSeek
DeepSeek V3.2 (Non-reasoning)
875-29 / +25Dec 2025
166
Google logoGoogle
Gemma 4 12B (Reasoning)
875-22 / +23Jun 2026
167
Arcee AI logoArcee AI
Trinity Large Thinking
864-23 / +22Apr 2026
168
Mistral logoMistral
Mistral Large 3
864-25 / +26Dec 2025
169
Kimi logoKimi
Kimi K2 0905
863-27 / +28Sep 2025
170
Mistral logoMistral
Mistral Small 4 (Reasoning)
859-24 / +24Mar 2026
171
Mistral logoMistral
Devstral 2
855-24 / +25Dec 2025
172
Amazon logoAmazon
Nova 2.0 Lite (high)
851-22 / +25Oct 2025
173
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)
851-27 / +26Sep 2025
174
Mistral logoMistral
Mistral Small 4 (Non-reasoning)
844-23 / +23Mar 2026
175
Alibaba logoAlibaba
Qwen3.5 9B (Non-reasoning)
843-22 / +22Mar 2026
176
Z AI logoZ AI
GLM-4.7-Flash (Reasoning)
837-26 / +25Jan 2026
177
LongCat logoLongCat
LongCat Flash Lite
836-27 / +25Jan 2026
178
China Mobile logoChina Mobile
JT-MINI
831-23 / +23Apr 2026
179
OpenAI logoOpenAI
gpt-oss-120b (low)
829-23 / +26Aug 2025
180
Mistral logoMistral
Devstral Small (May '25)
828-27 / +25May 2025
181
Multiverse Computing logoMultiverse Computing
HyperNova 60B 2605
827-26 / +25May 2026
182
LG AI Research logoLG AI Research
K-EXAONE (Reasoning)
824-27 / +25Dec 2025
183
Mistral logoMistral
Devstral Small 2
819-23 / +25Dec 2025
184
Alibaba logoAlibaba
Qwen3 235B A22B 2507 (Reasoning)
819-25 / +23Jul 2025
185
Alibaba logoAlibaba
Qwen3 Max (Preview)
815-24 / +23Sep 2025
186
KwaiKAT logoKwaiKAT
KAT-Coder-Pro V1
815-25 / +27Nov 2025
187
LG AI Research logoLG AI Research
EXAONE 4.5 33B
812-26 / +25Apr 2026
188
Z AI logoZ AI
GLM-4.7-Flash (Non-reasoning)
800-41 / +37Jan 2026
189
Baidu logoBaidu
ERNIE 5.0 Thinking Preview
788-27 / +27Nov 2025
190
InclusionAI logoInclusionAI
Ling 2.6 Flash
784-24 / +24Apr 2026
191
Amazon logoAmazon
Nova 2.0 Omni (medium)
782-26 / +26Nov 2025
192
Alibaba logoAlibaba
Qwen3 235B A22B 2507 Instruct
782-26 / +25Jul 2025
193
xAI logoxAI
Grok 4.1 Fast (Non-reasoning)
781-25 / +27Nov 2025
194
Mistral logoMistral
Mistral Medium 3.1
780-26 / +28Aug 2025
195
xAI logoxAI
Grok 4 Fast (Non-reasoning)
777-25 / +25Sep 2025
196
OpenAI logoOpenAI
GPT-4.1
776-27 / +30Apr 2025
197
Cohere logoCohere
North Mini Code
776-26 / +27Jun 2026
198
Alibaba logoAlibaba
Qwen3 VL 4B (Reasoning)
776-39 / +40Oct 2025
199
LG AI Research logoLG AI Research
K-EXAONE (Non-reasoning)
764-24 / +25Dec 2025
200
xAI logoxAI
Grok Code Fast 1
762-25 / +28Aug 2025
201
NVIDIA logoNVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
761-26 / +27Apr 2026
202
ByteDance Seed logoByteDance Seed
Seed-OSS-36B-Instruct
757-27 / +26Aug 2025
203
OpenAI logoOpenAI
o3
756-31 / +31Apr 2025
204
OpenAI logoOpenAI
GPT-5 nano (high)
755-24 / +26Aug 2025
205
Alibaba logoAlibaba
Qwen3 235B A22B (Reasoning)
755-24 / +28Apr 2025
206
NVIDIA logoNVIDIA
Nemotron Cascade 2 30B A3B
754-24 / +25Mar 2026
207
Prime Intellect logoPrime Intellect
INTELLECT-3
748-26 / +26Nov 2025
208
OpenAI logoOpenAI
o3-mini (high)
745-27 / +27Jan 2025
209
Google logoGoogle
Gemini 2.5 Flash (Non-reasoning)
738-26 / +27May 2025
210
Sarvam logoSarvam
Sarvam 105B (high)
737-23 / +24Mar 2026
211
Alibaba logoAlibaba
Qwen3 235B A22B (Non-reasoning)
736-28 / +28Apr 2025
212
OpenAI logoOpenAI
o1
729-28 / +28Dec 2024
213
Alibaba logoAlibaba
Qwen3 Next 80B A3B (Reasoning)
726-26 / +26Sep 2025
214
Alibaba logoAlibaba
Qwen3.5 9B (Reasoning)
711-23 / +22Mar 2026
215
Alibaba logoAlibaba
Qwen3 VL 235B A22B (Reasoning)
711-24 / +26Sep 2025
216
Alibaba logoAlibaba
Qwen3 Coder 30B A3B Instruct
710-25 / +27Jul 2025
217
Anthropic logoAnthropic
Claude 3.5 Haiku
709-23 / +24Oct 2024
218
Google logoGoogle
Gemini 2.5 Flash (Reasoning)
697-27 / +26May 2025
219
Z AI logoZ AI
GLM-4.6V (Non-reasoning)
691-27 / +26Dec 2025
220
Mistral logoMistral
Devstral Medium
691-26 / +25Jul 2025
221
InclusionAI logoInclusionAI
Ring-1T
684-26 / +28Oct 2025
222
Alibaba logoAlibaba
Qwen3 VL 8B Instruct
681-37 / +37Oct 2025
223
DeepSeek logoDeepSeek
DeepSeek R1 0528 (May '25)
680-28 / +28May 2025
224
Naver logoNaver
HyperCLOVA X SEED Think (32B)
675-24 / +27Dec 2025
225
Upstage logoUpstage
Solar Pro 3
675-27 / +26Apr 2026
226
Mistral logoMistral
Magistral Small 1.2
668-26 / +26Sep 2025
227
Alibaba logoAlibaba
Qwen3 VL 8B (Reasoning)
668-28 / +29Oct 2025
228
xAI logoxAI
Grok 3
666-25 / +27Feb 2025
229
Alibaba logoAlibaba
Qwen3.5 4B (Non-reasoning)
666-23 / +23Mar 2026
230
Alibaba logoAlibaba
Qwen3 VL 30B A3B (Reasoning)
666-36 / +37Oct 2025
231
Mistral logoMistral
Magistral Medium 1
665-27 / +26Jun 2025
232
Upstage logoUpstage
Solar Open 100B (Reasoning)
664-27 / +29Dec 2025
233
Alibaba logoAlibaba
Qwen3 30B A3B 2507 (Reasoning)
656-25 / +25Jul 2025
234
Mistral logoMistral
Ministral 3 14B
654-25 / +27Dec 2025
235
Amazon logoAmazon
Nova 2.0 Pro Preview (low)
653-28 / +27Nov 2025
236
OpenAI logoOpenAI
gpt-oss-20B (high)
647-23 / +25Aug 2025
237
Alibaba logoAlibaba
Qwen3 VL 32B (Reasoning)
646-26 / +27Oct 2025
238
Amazon logoAmazon
Nova 2.0 Lite (medium)
641-26 / +28Oct 2025
239
Korea Telecom logoKorea Telecom
Mi:dm K 2.5 Pro
641-25 / +28Dec 2025
240
Mistral logoMistral
Ministral 3 8B
635-27 / +26Dec 2025
241
Alibaba logoAlibaba
Qwen3 VL 235B A22B Instruct
634-35 / +38Sep 2025
242
Alibaba logoAlibaba
Qwen3 Next 80B A3B Instruct
627-27 / +28Sep 2025
243
Mistral logoMistral
Magistral Medium 1.2
627-28 / +28Sep 2025
244
OpenAI logoOpenAI
GPT-4.1 mini
619-27 / +28Apr 2025
245
DeepSeek logoDeepSeek
DeepSeek V3.1 (Reasoning)
612-29 / +30Aug 2025
246
Z AI logoZ AI
GLM-4.6V (Reasoning)
609-30 / +30Dec 2025
247
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2 Think V2
607-26 / +25Dec 2025
248
Alibaba logoAlibaba
Qwen3 4B 2507 (Reasoning)
591-27 / +27Aug 2025
249
OpenAI logoOpenAI
GPT-5 nano (medium)
588-27 / +28Aug 2025
250
Mistral logoMistral
Mistral Medium 3
584-26 / +26May 2025
251
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Reasoning)
584-23 / +23Aug 2025
252
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (medium)
579-28 / +26Dec 2025
253
ServiceNow logoServiceNow
Apriel-v1.6-15B-Thinker
571-25 / +25Nov 2025
254
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
566-26 / +26Dec 2025
255
Google logoGoogle
Gemini 2.0 Flash (Feb '25)
566-27 / +27Feb 2025
256
Mistral logoMistral
Devstral Small (Jul '25)
563-28 / +29Jul 2025
257
Z AI logoZ AI
GLM-4.5-Air
560-30 / +28Jul 2025
258
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (high)
557-27 / +26Dec 2025
259
Google logoGoogle
Gemma 4 12B (Non-reasoning)
553-26 / +25Jun 2026
260
OpenAI logoOpenAI
gpt-oss-20B (low)
549-25 / +26Aug 2025
261
IBM logoIBM
Granite 4.1 8B
543-24 / +26Apr 2026
262
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Reasoning)
536-24 / +23Aug 2025
263
Kimi logoKimi
Kimi K2
524-28 / +33Jul 2025
264
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Non-reasoning)
519-27 / +25Aug 2025
265
Alibaba logoAlibaba
Qwen3 30B A3B 2507 Instruct
513-25 / +26Jul 2025
266
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Non-reasoning)
510-23 / +24Aug 2025
267
Z AI logoZ AI
GLM-4.5V (Reasoning)
509-23 / +23Aug 2025
268
Alibaba logoAlibaba
Qwen3.5 4B (Reasoning)
509-28 / +30Mar 2026
269
Amazon logoAmazon
Nova 2.0 Lite (low)
508-29 / +29Oct 2025
270
Alibaba logoAlibaba
Qwen3 30B A3B (Reasoning)
506-27 / +25Apr 2025
271
Amazon logoAmazon
Nova Premier
505-28 / +28Apr 2025
272
Alibaba logoAlibaba
Qwen3 Coder 480B A35B Instruct
504-30 / +28Jul 2025
273
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Reasoning)
503-26 / +25Jul 2025
274
Allen Institute for AI logoAllen Institute for AI
Molmo2-8B
500-0 / +0Dec 2025
275
DeepSeek logoDeepSeek
DeepSeek V3.2 Speciale
500-0 / +0Dec 2025
276
Alibaba logoAlibaba
Qwen3 8B (Reasoning)
496-27 / +30Apr 2025
277
Alibaba logoAlibaba
Qwen3 VL 30B A3B Instruct
495-28 / +28Oct 2025
278
IBM logoIBM
Granite 4.1 30B
494-24 / +25Apr 2026
279
Alibaba logoAlibaba
Qwen3 Omni 30B A3B (Reasoning)
494-25 / +25Sep 2025
280
Alibaba logoAlibaba
Qwen3 32B (Reasoning)
489-26 / +28Apr 2025
281
Motif Technologies logoMotif Technologies
Motif-2-12.7B-Reasoning
483-30 / +31Dec 2025
282
Mistral logoMistral
Ministral 3 3B
483-28 / +27Dec 2025
283
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 4B
477-29 / +26Mar 2026
284
Alibaba logoAlibaba
Qwen3 14B (Reasoning)
476-28 / +28Apr 2025
285
Alibaba logoAlibaba
Qwen3 8B (Non-reasoning)
472-27 / +26Apr 2025
286
Alibaba logoAlibaba
Qwen3 14B (Non-reasoning)
471-27 / +27Apr 2025
287
OpenAI logoOpenAI
GPT-5 mini (minimal)
467-28 / +28Aug 2025
288
Z AI logoZ AI
GLM-4.5 (Reasoning)
467-31 / +32Jul 2025
289
Z AI logoZ AI
GLM-4.5V (Non-reasoning)
459-32 / +29Aug 2025
290
Upstage logoUpstage
Solar Pro 2 (Reasoning)
447-29 / +27Jul 2025
291
Upstage logoUpstage
Solar Pro 2 (Non-reasoning)
443-29 / +28Jul 2025
292
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Reasoning)
438-27 / +27Aug 2025
293
Meta logoMeta
Llama 4 Maverick
436-27 / +29Apr 2025
294
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)
434-28 / +28Sep 2025
295
InclusionAI logoInclusionAI
Ling-flash-2.0
420-28 / +30Sep 2025
296
xAI logoxAI
Grok 3 mini Reasoning (high)
418-41 / +37Feb 2025
297
DeepSeek logoDeepSeek
DeepSeek V3 (Dec '24)
409-29 / +28Dec 2024
298
DeepSeek logoDeepSeek
DeepSeek V3 0324
408-27 / +27Mar 2025
299
Meta logoMeta
Llama 3.3 Instruct 70B
401-32 / +29Dec 2024
300
InclusionAI logoInclusionAI
Ling-1T
399-27 / +29Oct 2025
301
Amazon logoAmazon
Nova Pro
386-30 / +29Dec 2024
302
OpenAI logoOpenAI
GPT-5 (minimal)
383-29 / +31Aug 2025
303
Amazon logoAmazon
Nova 2.0 Lite (Non-reasoning)
380-31 / +31Oct 2025
304
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)
380-30 / +27Sep 2025
305
Anthropic logoAnthropic
Claude 3 Haiku
377-23 / +23Mar 2024
306
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Non-reasoning)
377-27 / +27Jul 2025
307
OpenAI logoOpenAI
GPT-4o (Aug '24)
377-27 / +27Aug 2024
308
TII UAE logoTII UAE
Falcon-H1R-7B
372-30 / +28Jan 2026
309
Trillion Labs logoTrillion Labs
Tri-21B-Think
371-24 / +26Feb 2026
310
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Reasoning)
366-31 / +28Jul 2025
311
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (low)
363-29 / +27Dec 2025
312
IBM logoIBM
Granite 4.1 3B
363-24 / +23Apr 2026
313
Amazon logoAmazon
Nova 2.0 Omni (low)
359-33 / +30Nov 2025
314
Sarvam logoSarvam
Sarvam 30B (high)
358-23 / +23Mar 2026
315
Allen Institute for AI logoAllen Institute for AI
Olmo 3.1 32B Instruct
354-27 / +28Jan 2026
316
Nanbeige logoNanbeige
Nanbeige4.1-3B
354-30 / +31Feb 2026
317
OpenAI logoOpenAI
GPT-4o (Nov '24)
348-24 / +23Nov 2024
318
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)
347-30 / +29Dec 2025
319
Amazon logoAmazon
Nova Lite
342-31 / +31Dec 2024
320
IBM logoIBM
Granite 4.0 H Small
342-30 / +28Sep 2025
321
Alibaba logoAlibaba
Qwen3 VL 4B Instruct
339-28 / +27Oct 2025
322
Amazon logoAmazon
Nova Micro
338-31 / +29Dec 2024
323
Trillion Labs logoTrillion Labs
Tri-21B-think Preview
337-33 / +30Feb 2026
324
Mistral logoMistral
Mistral Small 3.1
335-30 / +26Mar 2025
325
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Instruct 70B
334-29 / +32Oct 2024
326
Alibaba logoAlibaba
Qwen3 30B A3B (Non-reasoning)
331-31 / +28Apr 2025
327
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Non-reasoning)
328-32 / +28Jul 2025
328
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Reasoning)
327-28 / +26Oct 2025
329
Mistral logoMistral
Mistral Large 2 (Nov '24)
323-30 / +30Nov 2024
330
Alibaba logoAlibaba
Qwen3.5 2B (Reasoning)
319-25 / +24Mar 2026
331
OpenAI logoOpenAI
GPT-4.1 nano
317-32 / +29Apr 2025
332
Google logoGoogle
Gemini 2.5 Flash-Lite (Reasoning)
316-30 / +30Jun 2025
333
Amazon logoAmazon
Nova 2.0 Pro Preview (Non-reasoning)
314-30 / +30Nov 2025
334
Alibaba logoAlibaba
Qwen3 0.6B (Reasoning)
313-31 / +28Apr 2025
335
Alibaba logoAlibaba
Qwen3 4B 2507 Instruct
307-30 / +27Aug 2025
336
Amazon logoAmazon
Nova 2.0 Omni (Non-reasoning)
303-31 / +28Nov 2025
337
Google logoGoogle
Gemma 4 E4B (Reasoning)
303-24 / +23Apr 2026
338
Google logoGoogle
Gemini 2.5 Flash-Lite (Non-reasoning)
303-28 / +30Jun 2025
339
Mistral logoMistral
Mistral Small 3.2
302-30 / +31Jun 2025
340
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Non-reasoning)
302-29 / +29Aug 2025
341
Alibaba logoAlibaba
Qwen3 VL 32B Instruct
299-34 / +29Oct 2025
342
Alibaba logoAlibaba
Qwen3 Omni 30B A3B Instruct
294-32 / +30Sep 2025
343
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Non-reasoning)
292-28 / +28Jul 2025
344
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Reasoning)
291-29 / +30Jul 2025
345
Google logoGoogle
Gemma 4 E4B (Non-reasoning)
289-27 / +27Apr 2026
346
IBM logoIBM
Granite 4.0 H 350M
289-28 / +30Oct 2025
347
Meta logoMeta
Llama 3.1 Instruct 70B
285-28 / +28Jul 2024
348
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)
284-29 / +29Oct 2025
349
Google logoGoogle
Gemma 3 27B Instruct
283-30 / +29Mar 2025
350
AI21 Labs logoAI21 Labs
Jamba 1.7 Large
281-30 / +30Jul 2025
351
OpenAI logoOpenAI
GPT-5 nano (minimal)
279-29 / +30Aug 2025
352
IBM logoIBM
Granite 4.0 Micro
279-30 / +27Sep 2025
353
Google logoGoogle
Gemma 3 12B Instruct
278-31 / +28Mar 2025
354
Alibaba logoAlibaba
Qwen3 0.6B (Non-reasoning)
277-33 / +30Apr 2025
355
Meta logoMeta
Llama 3.1 Instruct 8B
276-31 / +31Jul 2024
356
Allen Institute for AI logoAllen Institute for AI
Olmo 3 7B Instruct
276-27 / +29Nov 2025
357
Alibaba logoAlibaba
Qwen3.5 0.8B (Reasoning)
276-25 / +23Mar 2026
358
Alibaba logoAlibaba
Qwen3 1.7B (Reasoning)
276-30 / +29Apr 2025
359
Liquid AI logoLiquid AI
LFM2 1.2B
273-30 / +28Jul 2025
360
AI21 Labs logoAI21 Labs
Jamba 1.7 Mini
273-27 / +28Jul 2025
361
Cohere logoCohere
Command A
273-28 / +28Mar 2025
362
Google logoGoogle
Gemma 4 E2B (Reasoning)
270-27 / +26Apr 2026
363
Meta logoMeta
Llama 4 Scout
268-30 / +28Apr 2025
364
IBM logoIBM
Granite 4.0 350M
268-29 / +27Oct 2025
365
IBM logoIBM
Granite 4.0 H 1B
267-31 / +27Oct 2025
366
Liquid AI logoLiquid AI
LFM2.5-1.2B-Instruct
265-30 / +31Jan 2026
367
InclusionAI logoInclusionAI
Ling-mini-2.0
259-24 / +24Sep 2025
368
Liquid AI logoLiquid AI
LFM2 8B A1B
259-31 / +29Oct 2025
369
StepFun logoStepFun
Step3 VL 10B
256-28 / +28Jan 2026
370
Liquid AI logoLiquid AI
LFM2.5-1.2B-Thinking
256-30 / +29Jan 2026
371
AI21 Labs logoAI21 Labs
Jamba Reasoning 3B
255-29 / +27Oct 2025
372
Liquid AI logoLiquid AI
LFM2.5-8B-A1B
255-28 / +28May 2026
373
Google logoGoogle
Gemma 3 4B Instruct
255-30 / +30Mar 2025
374
OpenBMB logoOpenBMB
MiniCPM-V 4.6 1.3B
255-31 / +28May 2026
375
IBM logoIBM
Granite 4.0 1B
255-29 / +32Oct 2025
376
Meta logoMeta
Llama 3.1 Instruct 405B
255-33 / +28Jul 2024
377
Alibaba logoAlibaba
Qwen3 1.7B (Non-reasoning)
252-29 / +30Apr 2025
378
Google logoGoogle
Gemma 4 E2B (Non-reasoning)
250-27 / +26Apr 2026
379
DeepSeek logoDeepSeek
DeepSeek R1 (Jan '25)
250-29 / +28Jan 2025
380
Google logoGoogle
Gemma 3n E4B Instruct
243-33 / +32Jun 2025
381
OpenBMB logoOpenBMB
MiniCPM5-1B (Reasoning)
238-31 / +29May 2026
382
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)
238-25 / +24Apr 2025
383
Alibaba logoAlibaba
Qwen3.5 2B (Non-reasoning)
237-24 / +23Mar 2026
384
Liquid AI logoLiquid AI
LFM2 2.6B
235-27 / +29Sep 2025
385
Alibaba logoAlibaba
Qwen3.5 0.8B (Non-reasoning)
234-25 / +24Mar 2026
386
OpenBMB logoOpenBMB
MiniCPM5-1B (Non-reasoning)
233-27 / +26May 2026
387
Liquid AI logoLiquid AI
LFM2 24B A2B
233-25 / +24Feb 2026
388
Liquid AI logoLiquid AI
LFM2.5-VL-1.6B
232-31 / +31Jan 2026
389
Microsoft logoMicrosoft
Phi-4 Mini Instruct
227-30 / +28Feb 2024
390
IBM logoIBM
Granite 3.3 8B (Non-reasoning)
223-30 / +28Apr 2025
391
Allen Institute for AI logoAllen Institute for AI
Olmo 3.1 32B Think
0-0 / +0Dec 2025

Example Tasks

Frequently Asked Questions

GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,932 among models with published GDPval-AA results. View model

GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark LeaderboardITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.