All evaluations

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
See example tasks

The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.
The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval

GPT-5.5 (xhigh) scores the highest on GDPval with a score of 1769, followed by GPT-5.5 (high) with a score of 1753, and Claude Opus 4.7 (Adaptive Reasoning, Max Effort) with a score of 1753

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis
Stirrup Agent Harness
AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation
AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index
Most attractive quadrant
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Token Usage

GDPval-AA: Token Usage

Tokens used to run the evaluation
Input tokens
Reasoning tokens
Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

GDPval-AA: Cost Breakdown

Cost (USD) to run the evaluation
Input cost
Reasoning cost
Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

GDPval-AA Leaderboard

1
OpenAI logoOpenAI
GPT-5.5 (xhigh)
1769-32 / +31Apr 2026
2
OpenAI logoOpenAI
GPT-5.5 (high)
1753-28 / +32Apr 2026
3
Anthropic logoAnthropic
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
1753-41 / +40Apr 2026
4
Anthropic logoAnthropic
Claude Opus 4.7 (Non-reasoning, High Effort)
1683-26 / +28Apr 2026
5
Anthropic logoAnthropic
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
1676-26 / +29Feb 2026
6
OpenAI logoOpenAI
GPT-5.4 (xhigh)
1674-34 / +32Mar 2026
7
Google logoGoogle
Gemini 3.5 Flash (high)
1656-26 / +30May 2026
8
OpenAI logoOpenAI
GPT-5.5 (medium)
1654-26 / +27Apr 2026
9
Anthropic logoAnthropic
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
1619-31 / +33Feb 2026
10
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, High Effort)
1595-26 / +26Feb 2026
11
Anthropic logoAnthropic
Claude Opus 4.6 (Non-reasoning, High Effort)
1590-25 / +27Feb 2026
12
Xiaomi logoXiaomi
MiMo-V2.5-Pro
1571-27 / +28Apr 2026
13
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, High Effort)
1558-29 / +31Apr 2026
14
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, Max Effort)
1554-29 / +29Apr 2026
15
Xiaomi logoXiaomi
MiMo-V2.5
1553-25 / +28Apr 2026
16
Alibaba logoAlibaba
Qwen3.7 Max
1544-24 / +25May 2026
17
Z AI logoZ AI
GLM-5.1 (Reasoning)
1535-0 / +0Apr 2026
18
MiniMax logoMiniMax
MiniMax-M2.7
1505-24 / +26Mar 2026
19
Alibaba logoAlibaba
Qwen3.6 Max Preview
1504-20 / +21Apr 2026
20
OpenAI logoOpenAI
GPT-5.4 (low)
1503-27 / +29Mar 2026
21
Z AI logoZ AI
GLM-5-Turbo
1497-24 / +25Mar 2026
22
xAI logoxAI
Grok 4.3 (high)
1495-25 / +23Apr 2026
23
Z AI logoZ AI
GLM-5.1 (Non-reasoning)
1494-26 / +29Apr 2026
24
Kimi logoKimi
Kimi K2.6
1481-25 / +26Apr 2026
25
OpenAI logoOpenAI
GPT-5.3 Codex (xhigh)
1479-25 / +26Feb 2026
26
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Non-reasoning)
1478-26 / +26Apr 2026
27
OpenAI logoOpenAI
GPT-5.2 (xhigh)
1467-25 / +26Dec 2025
28
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, Low Effort)
1455-25 / +23Feb 2026
29
Anthropic logoAnthropic
Claude Opus 4.5 (Reasoning)
1451-26 / +27Nov 2025
30
Google logoGoogle
Gemini 3.5 Flash (minimal)
1443-25 / +27May 2026
31
OpenAI logoOpenAI
GPT-5.5 (low)
1442-26 / +23Apr 2026
32
OpenAI logoOpenAI
GPT-5.4 mini (xhigh)
1438-23 / +26Mar 2026
33
Anthropic logoAnthropic
Claude Opus 4.5 (Non-reasoning)
1419-22 / +22Nov 2025
34
Meta logoMeta
Muse Spark
1417-23 / +23Apr 2026
35
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, High Effort)
1414-25 / +26Apr 2026
36
Xiaomi logoXiaomi
MiMo-V2-Pro
1408-24 / +22Mar 2026
37
OpenAI logoOpenAI
GPT-5.2 (medium)
1404-22 / +23Dec 2025
38
Alibaba logoAlibaba
Qwen3.6 27B (Reasoning)
1404-23 / +25Apr 2026
39
Z AI logoZ AI
GLM-5 (Reasoning)
1394-23 / +23Feb 2026
40
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Non-reasoning)
1393-27 / +26Apr 2026
41
Alibaba logoAlibaba
Qwen3.6 27B (Non-reasoning)
1391-25 / +23Apr 2026
42
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, Max Effort)
1388-20 / +34Apr 2026
43
Alibaba logoAlibaba
Qwen3.6 Plus
1351-24 / +24Apr 2026
44
Xiaomi logoXiaomi
MiMo-V2-Omni-0327
1346-23 / +24Mar 2026
45
OpenAI logoOpenAI
GPT-5.4 (Non-reasoning)
1342-26 / +26Mar 2026
46
Z AI logoZ AI
GLM 5V Turbo (Reasoning)
1331-23 / +26Apr 2026
47
Kimi logoKimi
Kimi K2.6 (Non-reasoning)
1324-28 / +28Apr 2026
48
Google logoGoogle
Gemini 3 Deep Think
1324-30 / +31Feb 2026
49
Z AI logoZ AI
GLM-5 (Non-reasoning)
1323-21 / +25Feb 2026
50
Anthropic logoAnthropic
Claude 4.5 Sonnet (Reasoning)
1322-24 / +25Sep 2025
51
Xiaomi logoXiaomi
MiMo-V2-Omni
1320-24 / +22Mar 2026
52
 logo
Claude Pro - 4.5 Opus (Extended Thinking)
1319-41 / +38-
53
OpenAI logoOpenAI
GPT-5.4 mini (medium)
1319-22 / +22Mar 2026
54
OpenAI logoOpenAI
GPT-5.5 (Non-reasoning)
1314-24 / +25Apr 2026
55
Google logoGoogle
Gemini 3.1 Pro Preview
1314-26 / +27Feb 2026
56
xAI logoxAI
Grok 4.3 (medium)
1312-24 / +24Apr 2026
57
Anthropic logoAnthropic
Claude 4.5 Sonnet (Non-reasoning)
1311-23 / +22Sep 2025
58
xAI logoxAI
Grok 4.3 (Non-reasoning)
1303-25 / +26Apr 2026
59
Alibaba logoAlibaba
Qwen3.6 35B A3B (Reasoning)
1297-24 / +24Apr 2026
60
Xiaomi logoXiaomi
MiMo-V2.5-Pro (Non-reasoning)
1296-24 / +25Apr 2026
61
OpenAI logoOpenAI
GPT-5 (high)
1295-21 / +22Aug 2025
62
OpenAI logoOpenAI
GPT-5.2 Codex (xhigh)
1289-27 / +28Dec 2025
63
Kimi logoKimi
Kimi K2.5 (Reasoning)
1287-24 / +24Jan 2026
64
Kimi logoKimi
Kimi K2.5 (Non-reasoning)
1265-26 / +24Jan 2026
65
Tencent logoTencent
Hy3-preview (Reasoning)
1238-24 / +23Apr 2026
66
OpenAI logoOpenAI
GPT-5.1 (high)
1228-21 / +24Nov 2025
67
Tencent logoTencent
Hy3-preview (Non-reasoning)
1225-28 / +26Apr 2026
68
OpenAI logoOpenAI
GPT-5.2 (Non-reasoning)
1223-23 / +22Dec 2025
69
Alibaba logoAlibaba
Qwen3.5 397B A17B (Non-reasoning)
1222-23 / +23Feb 2026
70
Alibaba logoAlibaba
Qwen3.6 35B A3B (Non-reasoning)
1222-24 / +24Apr 2026
71
OpenAI logoOpenAI
GPT-5 Codex (high)
1214-24 / +22Sep 2025
72
Google logoGoogle
Gemini 3 Flash Preview (Reasoning)
1204-24 / +24Dec 2025
73
OpenAI logoOpenAI
GPT-5.4 nano (medium)
1200-22 / +22Mar 2026
74
DeepSeek logoDeepSeek
DeepSeek V3.2 (Reasoning)
1197-24 / +22Dec 2025
75
OpenAI logoOpenAI
GPT-5.1 Codex (high)
1193-26 / +25Nov 2025
76
OpenAI logoOpenAI
GPT-5.4 nano (xhigh)
1191-25 / +23Mar 2026
77
Alibaba logoAlibaba
Qwen3.5 397B A17B (Reasoning)
1190-23 / +22Feb 2026
78
Z AI logoZ AI
GLM-4.7 (Reasoning)
1186-22 / +23Dec 2025
79
OpenAI logoOpenAI
GPT-5 mini (high)
1185-24 / +23Aug 2025
80
Google logoGoogle
Gemini 3 Pro Preview (high)
1185-23 / +22Nov 2025
81
Alibaba logoAlibaba
Qwen3.5 Omni Plus
1184-22 / +24Mar 2026
82
MiniMax logoMiniMax
MiniMax-M2.5
1180-24 / +23Feb 2026
83
Z AI logoZ AI
GLM-4.7 (Non-reasoning)
1177-23 / +23Dec 2025
84
xAI logoxAI
Grok 4.20 0309 v2 (Reasoning)
1172-22 / +23Apr 2026
85
Anthropic logoAnthropic
Claude 4.5 Haiku (Reasoning)
1171-24 / +25Oct 2025
86
Google logoGoogle
Gemini 3 Pro Preview (low)
1168-27 / +26Nov 2025
87
Mistral logoMistral
Mistral Medium 3.5
1168-25 / +24Apr 2026
88
Alibaba logoAlibaba
Qwen3.5 27B (Non-reasoning)
1162-24 / +21Feb 2026
89
Alibaba logoAlibaba
Qwen3.5 27B (Reasoning)
1158-23 / +22Feb 2026
90
OpenAI logoOpenAI
GPT-5 (low)
1151-23 / +23Aug 2025
91
 logo
ChatGPT Plus - 5.1 Thinking (Extended Thinking)
1149-41 / +45-
92
Alibaba logoAlibaba
Qwen3 Max Thinking
1140-24 / +22Jan 2026
93
Anthropic logoAnthropic
Claude 4.5 Haiku (Non-reasoning)
1136-27 / +26Oct 2025
94
Anthropic logoAnthropic
Claude 4 Sonnet (Reasoning)
1134-28 / +26May 2025
95
Anthropic logoAnthropic
Claude 4 Sonnet (Non-reasoning)
1129-25 / +23May 2025
96
xAI logoxAI
Grok 4.3 (low)
1124-23 / +25Apr 2026
97
InclusionAI logoInclusionAI
Ring-2.6-1T
1124-24 / +28May 2026
98
KwaiKAT logoKwaiKAT
KAT Coder Pro V2
1119-24 / +23Mar 2026
99
Google logoGoogle
Gemini 3 Flash Preview (Non-reasoning)
1117-25 / +28Dec 2025
100
Alibaba logoAlibaba
Qwen3.5 122B A10B (Reasoning)
1116-24 / +21Feb 2026
101
Google logoGoogle
Gemma 4 31B (Reasoning)
1113-22 / +23Apr 2026
102
Alibaba logoAlibaba
Qwen3.5 122B A10B (Non-reasoning)
1112-23 / +23Feb 2026
103
MiniMax logoMiniMax
MiniMax-M2.1
1089-24 / +26Dec 2025
104
DeepSeek logoDeepSeek
DeepSeek V3.1 (Non-reasoning)
1082-24 / +24Aug 2025
105
Xiaomi logoXiaomi
MiMo-V2-Flash (Reasoning)
1081-26 / +23Dec 2025
106
China Mobile logoChina Mobile
JT-35B-Flash
1077-23 / +25May 2026
107
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Reasoning)
1073-24 / +23Sep 2025
108
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Non-reasoning)
1073-26 / +24Sep 2025
109
StepFun logoStepFun
Step 3.5 Flash 2603
1069-24 / +23Apr 2026
110
Xiaomi logoXiaomi
MiMo-V2-Flash (Non-reasoning)
1064-27 / +24Dec 2025
111
StepFun logoStepFun
Step 3.5 Flash
1055-26 / +27Feb 2026
112
OpenAI logoOpenAI
GPT-5.1 Codex mini (high)
1052-23 / +25Nov 2025
113
Alibaba logoAlibaba
Qwen3.5 35B A3B (Non-reasoning)
1050-21 / +21Feb 2026
114
Anthropic logoAnthropic
Claude 3.7 Sonnet (Reasoning)
1049-26 / +24Feb 2025
115
Anthropic logoAnthropic
Claude 3.7 Sonnet (Non-reasoning)
1048-24 / +24Feb 2025
116
InclusionAI logoInclusionAI
Ling-2.6-1T
1046-23 / +24Apr 2026
117
xAI logoxAI
Grok 4.1 Fast (Reasoning)
1044-24 / +24Nov 2025
118
Xiaomi logoXiaomi
MiMo-V2-Flash (Feb 2026)
1044-27 / +26Dec 2025
119
xAI logoxAI
Grok 4.20 0309 (Reasoning)
1043-22 / +22Mar 2026
120
Alibaba logoAlibaba
Qwen3 Max
1040-24 / +24Sep 2025
121
xAI logoxAI
Grok 4.20 0309 v2 (Non-reasoning)
1039-27 / +27Apr 2026
122
MiniMax logoMiniMax
MiniMax-M2
1032-27 / +25Oct 2025
123
 logo
Perplexity Pro - Labs
1032-41 / +39-
124
Z AI logoZ AI
GLM-4.6 (Reasoning)
1030-28 / +27Sep 2025
125
Google logoGoogle
Gemma 4 26B A4B (Reasoning)
1016-23 / +24Apr 2026
126
xAI logoxAI
Grok 4 Fast (Reasoning)
1014-24 / +24Sep 2025
127
OpenAI logoOpenAI
o4-mini (high)
1008-24 / +22Apr 2025
128
OpenAI logoOpenAI
GPT-5.4 mini (Non-Reasoning)
1006-24 / +24Mar 2026
129
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Reasoning)
1005-26 / +28Sep 2025
130
Google logoGoogle
Gemma 4 31B (Non-reasoning)
1004-23 / +21Apr 2026
131
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)
1003-22 / +21Mar 2026
132
OpenAI logoOpenAI
GPT-5 mini (medium)
1003-28 / +27Aug 2025
133
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Reasoning)
1003-25 / +23Sep 2025
134
OpenAI logoOpenAI
GPT-5 (medium)
1001-27 / +27Aug 2025
135
OpenAI logoOpenAI
GPT-5.1 (Non-reasoning)
1000-0 / +0Nov 2025
136
MiniMax logoMiniMax
MiniMax M1 80k
995-24 / +25Jun 2025
137
Kimi logoKimi
Kimi K2 Thinking
993-25 / +23Nov 2025
138
ByteDance Seed logoByteDance Seed
Doubao Seed Code
987-27 / +26Nov 2025
139
xAI logoxAI
Grok 4
985-24 / +24Jul 2025
140
Z AI logoZ AI
GLM-4.6 (Non-reasoning)
985-23 / +25Sep 2025
141
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Non-reasoning)
976-23 / +25Sep 2025
142
Amazon logoAmazon
Nova 2.0 Pro Preview (medium)
973-27 / +23Nov 2025
143
 logo
Google AI Pro - Thinking with 3 Pro
972-43 / +43-
144
Inception logoInception
Mercury 2
958-22 / +22Feb 2026
145
Google logoGoogle
Gemma 4 26B A4B (Non-reasoning)
949-22 / +22Apr 2026
146
Alibaba logoAlibaba
Qwen3 Max Thinking (Preview)
948-26 / +27Nov 2025
147
OpenAI logoOpenAI
gpt-oss-120b (high)
947-28 / +27Aug 2025
148
OpenAI logoOpenAI
GPT-5.4 nano (Non-Reasoning)
944-32 / +28Mar 2026
149
Google logoGoogle
Gemini 3.1 Flash-Lite Preview
926-22 / +23Mar 2026
150
Cohere logoCohere
Command A+
920-24 / +24May 2026
151
Google logoGoogle
Gemini 2.5 Pro
915-25 / +26Jun 2025
152
Alibaba logoAlibaba
Qwen3 Coder Next
914-23 / +25Feb 2026
153
xAI logoxAI
Grok 4.20 0309 (Non-reasoning)
910-22 / +23Mar 2026
154
Alibaba logoAlibaba
Qwen3.5 35B A3B (Reasoning)
908-22 / +24Feb 2026
155
Alibaba logoAlibaba
Qwen3.5 Omni Flash
898-25 / +24Mar 2026
156
 logo
SuperGrok - Grok 4
882-46 / +40-
157
DeepSeek logoDeepSeek
DeepSeek V3.2 (Non-reasoning)
877-27 / +27Dec 2025
158
Arcee AI logoArcee AI
Trinity Large Thinking
866-24 / +23Apr 2026
159
Kimi logoKimi
Kimi K2 0905
865-29 / +28Sep 2025
160
Mistral logoMistral
Mistral Large 3
864-25 / +23Dec 2025
161
Mistral logoMistral
Mistral Small 4 (Reasoning)
863-24 / +23Mar 2026
162
Mistral logoMistral
Devstral 2
856-24 / +26Dec 2025
163
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)
853-27 / +27Sep 2025
164
Amazon logoAmazon
Nova 2.0 Lite (high)
852-25 / +24Oct 2025
165
Mistral logoMistral
Mistral Small 4 (Non-reasoning)
846-24 / +21Mar 2026
166
Alibaba logoAlibaba
Qwen3.5 9B (Non-reasoning)
843-23 / +24Mar 2026
167
LongCat logoLongCat
LongCat Flash Lite
838-27 / +25Jan 2026
168
Z AI logoZ AI
GLM-4.7-Flash (Reasoning)
838-25 / +24Jan 2026
169
Mistral logoMistral
Devstral Small (May '25)
833-26 / +26May 2025
170
OpenAI logoOpenAI
gpt-oss-120b (low)
832-23 / +23Aug 2025
171
China Mobile logoChina Mobile
JT-MINI
831-25 / +25Apr 2026
172
LG AI Research logoLG AI Research
K-EXAONE (Reasoning)
826-26 / +26Dec 2025
173
Alibaba logoAlibaba
Qwen3 235B A22B 2507 (Reasoning)
822-24 / +25Jul 2025
174
Alibaba logoAlibaba
Qwen3 Max (Preview)
820-25 / +22Sep 2025
175
Mistral logoMistral
Devstral Small 2
819-25 / +25Dec 2025
176
KwaiKAT logoKwaiKAT
KAT-Coder-Pro V1
819-26 / +25Nov 2025
177
LG AI Research logoLG AI Research
EXAONE 4.5 33B
814-24 / +24Apr 2026
178
Z AI logoZ AI
GLM-4.7-Flash (Non-reasoning)
803-38 / +37Jan 2026
179
Baidu logoBaidu
ERNIE 5.0 Thinking Preview
790-27 / +27Nov 2025
180
xAI logoxAI
Grok 4.1 Fast (Non-reasoning)
785-28 / +28Nov 2025
181
Amazon logoAmazon
Nova 2.0 Omni (medium)
784-28 / +27Nov 2025
182
InclusionAI logoInclusionAI
Ling 2.6 Flash
783-22 / +23Apr 2026
183
Mistral logoMistral
Mistral Medium 3.1
781-26 / +27Aug 2025
184
Alibaba logoAlibaba
Qwen3 235B A22B 2507 Instruct
781-27 / +27Jul 2025
185
OpenAI logoOpenAI
GPT-4.1
777-27 / +26Apr 2025
186
xAI logoxAI
Grok 4 Fast (Non-reasoning)
777-25 / +27Sep 2025
187
Alibaba logoAlibaba
Qwen3 VL 4B (Reasoning)
776-39 / +40Oct 2025
188
NVIDIA logoNVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
767-29 / +26Apr 2026
189
xAI logoxAI
Grok Code Fast 1
765-27 / +28Aug 2025
190
LG AI Research logoLG AI Research
K-EXAONE (Non-reasoning)
763-27 / +24Dec 2025
191
ByteDance Seed logoByteDance Seed
Seed-OSS-36B-Instruct
760-26 / +26Aug 2025
192
NVIDIA logoNVIDIA
Nemotron Cascade 2 30B A3B
759-25 / +23Mar 2026
193
Alibaba logoAlibaba
Qwen3 235B A22B (Reasoning)
757-28 / +27Apr 2025
194
OpenAI logoOpenAI
GPT-5 nano (high)
756-27 / +25Aug 2025
195
OpenAI logoOpenAI
o3
754-30 / +29Apr 2025
196
Prime Intellect logoPrime Intellect
INTELLECT-3
752-27 / +25Nov 2025
197
OpenAI logoOpenAI
o3-mini (high)
748-25 / +27Jan 2025
198
Google logoGoogle
Gemini 2.5 Flash (Non-reasoning)
742-28 / +28May 2025
199
Sarvam logoSarvam
Sarvam 105B (high)
740-24 / +22Mar 2026
200
Alibaba logoAlibaba
Qwen3 235B A22B (Non-reasoning)
740-29 / +27Apr 2025
201
OpenAI logoOpenAI
o1
737-27 / +26Dec 2024
202
Alibaba logoAlibaba
Qwen3 Next 80B A3B (Reasoning)
727-27 / +25Sep 2025
203
Alibaba logoAlibaba
Qwen3.5 9B (Reasoning)
717-22 / +22Mar 2026
204
Alibaba logoAlibaba
Qwen3 VL 235B A22B (Reasoning)
716-27 / +27Sep 2025
205
Alibaba logoAlibaba
Qwen3 Coder 30B A3B Instruct
713-27 / +24Jul 2025
206
Anthropic logoAnthropic
Claude 3.5 Haiku
708-25 / +24Oct 2024
207
Google logoGoogle
Gemini 2.5 Flash (Reasoning)
699-31 / +29May 2025
208
Mistral logoMistral
Devstral Medium
693-27 / +25Jul 2025
209
Z AI logoZ AI
GLM-4.6V (Non-reasoning)
693-28 / +26Dec 2025
210
InclusionAI logoInclusionAI
Ring-1T
688-26 / +26Oct 2025
211
Alibaba logoAlibaba
Qwen3 VL 8B Instruct
684-36 / +39Oct 2025
212
DeepSeek logoDeepSeek
DeepSeek R1 0528 (May '25)
682-28 / +26May 2025
213
Naver logoNaver
HyperCLOVA X SEED Think (32B)
681-26 / +24Dec 2025
214
Upstage logoUpstage
Solar Pro 3
677-24 / +24Apr 2026
215
Mistral logoMistral
Magistral Small 1.2
670-26 / +25Sep 2025
216
Alibaba logoAlibaba
Qwen3.5 4B (Non-reasoning)
670-25 / +23Mar 2026
217
Alibaba logoAlibaba
Qwen3 VL 8B (Reasoning)
670-27 / +26Oct 2025
218
xAI logoxAI
Grok 3
668-27 / +27Feb 2025
219
Alibaba logoAlibaba
Qwen3 VL 30B A3B (Reasoning)
668-38 / +37Oct 2025
220
Mistral logoMistral
Magistral Medium 1
667-28 / +27Jun 2025
221
Upstage logoUpstage
Solar Open 100B (Reasoning)
666-31 / +27Dec 2025
222
Alibaba logoAlibaba
Qwen3 30B A3B 2507 (Reasoning)
664-26 / +24Jul 2025
223
Amazon logoAmazon
Nova 2.0 Pro Preview (low)
661-28 / +29Nov 2025
224
Mistral logoMistral
Ministral 3 14B
658-28 / +27Dec 2025
225
OpenAI logoOpenAI
gpt-oss-20B (high)
653-26 / +25Aug 2025
226
Alibaba logoAlibaba
Qwen3 VL 32B (Reasoning)
648-29 / +28Oct 2025
227
Amazon logoAmazon
Nova 2.0 Lite (medium)
644-25 / +25Oct 2025
228
Korea Telecom logoKorea Telecom
Mi:dm K 2.5 Pro
643-27 / +26Dec 2025
229
Mistral logoMistral
Ministral 3 8B
640-29 / +29Dec 2025
230
Alibaba logoAlibaba
Qwen3 VL 235B A22B Instruct
637-37 / +39Sep 2025
231
Mistral logoMistral
Magistral Medium 1.2
629-28 / +26Sep 2025
232
Alibaba logoAlibaba
Qwen3 Next 80B A3B Instruct
627-27 / +27Sep 2025
233
OpenAI logoOpenAI
GPT-4.1 mini
621-28 / +28Apr 2025
234
DeepSeek logoDeepSeek
DeepSeek V3.1 (Reasoning)
613-29 / +25Aug 2025
235
Z AI logoZ AI
GLM-4.6V (Reasoning)
610-29 / +29Dec 2025
236
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2 Think V2
608-27 / +26Dec 2025
237
OpenAI logoOpenAI
GPT-5 nano (medium)
595-28 / +26Aug 2025
238
Alibaba logoAlibaba
Qwen3 4B 2507 (Reasoning)
590-28 / +27Aug 2025
239
Mistral logoMistral
Mistral Medium 3
587-29 / +27May 2025
240
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Reasoning)
587-25 / +25Aug 2025
241
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (medium)
582-28 / +27Dec 2025
242
ServiceNow logoServiceNow
Apriel-v1.6-15B-Thinker
575-27 / +27Nov 2025
243
Google logoGoogle
Gemini 2.0 Flash (Feb '25)
571-27 / +24Feb 2025
244
Mistral logoMistral
Devstral Small (Jul '25)
565-30 / +29Jul 2025
245
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
565-27 / +28Dec 2025
246
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (high)
562-29 / +26Dec 2025
247
Z AI logoZ AI
GLM-4.5-Air
560-31 / +29Jul 2025
248
OpenAI logoOpenAI
gpt-oss-20B (low)
550-27 / +26Aug 2025
249
IBM logoIBM
Granite 4.1 8B
542-27 / +26Apr 2026
250
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Reasoning)
539-23 / +23Aug 2025
251
Kimi logoKimi
Kimi K2
527-34 / +31Jul 2025
252
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Non-reasoning)
523-24 / +24Aug 2025
253
Alibaba logoAlibaba
Qwen3 30B A3B 2507 Instruct
516-27 / +28Jul 2025
254
Z AI logoZ AI
GLM-4.5V (Reasoning)
511-24 / +22Aug 2025
255
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Non-reasoning)
511-24 / +23Aug 2025
256
Alibaba logoAlibaba
Qwen3.5 4B (Reasoning)
511-28 / +29Mar 2026
257
Amazon logoAmazon
Nova 2.0 Lite (low)
509-28 / +27Oct 2025
258
Alibaba logoAlibaba
Qwen3 Coder 480B A35B Instruct
507-31 / +28Jul 2025
259
Amazon logoAmazon
Nova Premier
507-30 / +29Apr 2025
260
Alibaba logoAlibaba
Qwen3 30B A3B (Reasoning)
503-27 / +26Apr 2025
261
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Reasoning)
502-29 / +27Jul 2025
262
Allen Institute for AI logoAllen Institute for AI
Molmo2-8B
500-0 / +0Dec 2025
263
DeepSeek logoDeepSeek
DeepSeek V3.2 Speciale
500-0 / +0Dec 2025
264
Alibaba logoAlibaba
Qwen3 8B (Reasoning)
498-28 / +27Apr 2025
265
Alibaba logoAlibaba
Qwen3 VL 30B A3B Instruct
498-30 / +28Oct 2025
266
IBM logoIBM
Granite 4.1 30B
497-26 / +24Apr 2026
267
Alibaba logoAlibaba
Qwen3 Omni 30B A3B (Reasoning)
497-28 / +25Sep 2025
268
Alibaba logoAlibaba
Qwen3 32B (Reasoning)
491-28 / +25Apr 2025
269
Motif Technologies logoMotif Technologies
Motif-2-12.7B-Reasoning
485-30 / +27Dec 2025
270
Mistral logoMistral
Ministral 3 3B
485-29 / +27Dec 2025
271
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 4B
478-30 / +30Mar 2026
272
Alibaba logoAlibaba
Qwen3 14B (Reasoning)
477-26 / +28Apr 2025
273
Alibaba logoAlibaba
Qwen3 14B (Non-reasoning)
474-28 / +27Apr 2025
274
Alibaba logoAlibaba
Qwen3 8B (Non-reasoning)
470-27 / +27Apr 2025
275
OpenAI logoOpenAI
GPT-5 mini (minimal)
470-31 / +31Aug 2025
276
Z AI logoZ AI
GLM-4.5 (Reasoning)
469-34 / +32Jul 2025
277
Z AI logoZ AI
GLM-4.5V (Non-reasoning)
461-31 / +28Aug 2025
278
Upstage logoUpstage
Solar Pro 2 (Reasoning)
451-31 / +27Jul 2025
279
Upstage logoUpstage
Solar Pro 2 (Non-reasoning)
447-29 / +27Jul 2025
280
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Reasoning)
440-28 / +28Aug 2025
281
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)
438-30 / +30Sep 2025
282
Meta logoMeta
Llama 4 Maverick
438-29 / +28Apr 2025
283
InclusionAI logoInclusionAI
Ling-flash-2.0
420-31 / +28Sep 2025
284
xAI logoxAI
Grok 3 mini Reasoning (high)
420-39 / +39Feb 2025
285
DeepSeek logoDeepSeek
DeepSeek V3 (Dec '24)
409-27 / +29Dec 2024
286
DeepSeek logoDeepSeek
DeepSeek V3 0324
408-29 / +29Mar 2025
287
Meta logoMeta
Llama 3.3 Instruct 70B
401-33 / +28Dec 2024
288
InclusionAI logoInclusionAI
Ling-1T
400-27 / +28Oct 2025
289
OpenAI logoOpenAI
GPT-5 (minimal)
388-29 / +32Aug 2025
290
Amazon logoAmazon
Nova Pro
388-29 / +28Dec 2024
291
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)
382-28 / +30Sep 2025
292
Amazon logoAmazon
Nova 2.0 Lite (Non-reasoning)
381-29 / +30Oct 2025
293
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Non-reasoning)
380-27 / +28Jul 2025
294
Anthropic logoAnthropic
Claude 3 Haiku
379-26 / +22Mar 2024
295
OpenAI logoOpenAI
GPT-4o (Aug '24)
378-27 / +27Aug 2024
296
Trillion Labs logoTrillion Labs
Tri-21B-Think
374-25 / +24Feb 2026
297
TII UAE logoTII UAE
Falcon-H1R-7B
373-32 / +29Jan 2026
298
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Reasoning)
369-30 / +29Jul 2025
299
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (low)
368-30 / +28Dec 2025
300
IBM logoIBM
Granite 4.1 3B
365-27 / +25Apr 2026
301
Amazon logoAmazon
Nova 2.0 Omni (low)
361-33 / +30Nov 2025
302
Sarvam logoSarvam
Sarvam 30B (high)
360-26 / +23Mar 2026
303
Allen Institute for AI logoAllen Institute for AI
Olmo 3.1 32B Instruct
358-31 / +27Jan 2026
304
Nanbeige logoNanbeige
Nanbeige4.1-3B
357-31 / +29Feb 2026
305
OpenAI logoOpenAI
GPT-4o (Nov '24)
349-25 / +24Nov 2024
306
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)
349-29 / +28Dec 2025
307
Amazon logoAmazon
Nova Lite
344-32 / +27Dec 2024
308
IBM logoIBM
Granite 4.0 H Small
344-30 / +27Sep 2025
309
Alibaba logoAlibaba
Qwen3 VL 4B Instruct
343-26 / +29Oct 2025
310
Amazon logoAmazon
Nova Micro
340-31 / +34Dec 2024
311
Mistral logoMistral
Mistral Small 3.1
338-28 / +27Mar 2025
312
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Instruct 70B
337-30 / +31Oct 2024
313
Trillion Labs logoTrillion Labs
Tri-21B-think Preview
337-33 / +30Feb 2026
314
Alibaba logoAlibaba
Qwen3 30B A3B (Non-reasoning)
332-28 / +28Apr 2025
315
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Non-reasoning)
330-33 / +29Jul 2025
316
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Reasoning)
328-28 / +27Oct 2025
317
Mistral logoMistral
Mistral Large 2 (Nov '24)
325-31 / +30Nov 2024
318
Alibaba logoAlibaba
Qwen3.5 2B (Reasoning)
322-25 / +22Mar 2026
319
Google logoGoogle
Gemini 2.5 Flash-Lite (Reasoning)
320-32 / +27Jun 2025
320
OpenAI logoOpenAI
GPT-4.1 nano
320-29 / +30Apr 2025
321
Amazon logoAmazon
Nova 2.0 Pro Preview (Non-reasoning)
319-32 / +26Nov 2025
322
Alibaba logoAlibaba
Qwen3 0.6B (Reasoning)
315-30 / +29Apr 2025
323
Alibaba logoAlibaba
Qwen3 4B 2507 Instruct
307-32 / +28Aug 2025
324
Amazon logoAmazon
Nova 2.0 Omni (Non-reasoning)
305-32 / +28Nov 2025
325
Mistral logoMistral
Mistral Small 3.2
304-31 / +30Jun 2025
326
Google logoGoogle
Gemini 2.5 Flash-Lite (Non-reasoning)
304-30 / +32Jun 2025
327
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Non-reasoning)
303-30 / +30Aug 2025
328
Google logoGoogle
Gemma 4 E4B (Reasoning)
303-25 / +23Apr 2026
329
Alibaba logoAlibaba
Qwen3 VL 32B Instruct
301-32 / +29Oct 2025
330
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Non-reasoning)
296-31 / +29Jul 2025
331
Alibaba logoAlibaba
Qwen3 Omni 30B A3B Instruct
296-34 / +30Sep 2025
332
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Reasoning)
294-29 / +27Jul 2025
333
IBM logoIBM
Granite 4.0 H 350M
293-32 / +26Oct 2025
334
Google logoGoogle
Gemma 4 E4B (Non-reasoning)
292-26 / +24Apr 2026
335
Google logoGoogle
Gemma 3 27B Instruct
287-31 / +30Mar 2025
336
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)
287-28 / +28Oct 2025
337
AI21 Labs logoAI21 Labs
Jamba 1.7 Large
285-30 / +29Jul 2025
338
Meta logoMeta
Llama 3.1 Instruct 70B
284-31 / +27Jul 2024
339
OpenAI logoOpenAI
GPT-5 nano (minimal)
283-32 / +31Aug 2025
340
Google logoGoogle
Gemma 3 12B Instruct
280-31 / +28Mar 2025
341
IBM logoIBM
Granite 4.0 Micro
279-30 / +26Sep 2025
342
Alibaba logoAlibaba
Qwen3 0.6B (Non-reasoning)
279-33 / +29Apr 2025
343
Meta logoMeta
Llama 3.1 Instruct 8B
278-34 / +29Jul 2024
344
Alibaba logoAlibaba
Qwen3.5 0.8B (Reasoning)
277-25 / +24Mar 2026
345
Allen Institute for AI logoAllen Institute for AI
Olmo 3 7B Instruct
277-29 / +28Nov 2025
346
Cohere logoCohere
Command A
277-28 / +26Mar 2025
347
AI21 Labs logoAI21 Labs
Jamba 1.7 Mini
276-31 / +28Jul 2025
348
Alibaba logoAlibaba
Qwen3 1.7B (Reasoning)
275-33 / +30Apr 2025
349
Liquid AI logoLiquid AI
LFM2 1.2B
272-30 / +31Jul 2025
350
Meta logoMeta
Llama 4 Scout
272-30 / +30Apr 2025
351
Google logoGoogle
Gemma 4 E2B (Reasoning)
272-26 / +23Apr 2026
352
IBM logoIBM
Granite 4.0 H 1B
270-32 / +28Oct 2025
353
IBM logoIBM
Granite 4.0 350M
269-33 / +29Oct 2025
354
Liquid AI logoLiquid AI
LFM2.5-1.2B-Instruct
267-33 / +32Jan 2026
355
InclusionAI logoInclusionAI
Ling-mini-2.0
263-25 / +24Sep 2025
356
OpenBMB logoOpenBMB
MiniCPM-V 4.6 1.3B
261-31 / +26May 2026
357
IBM logoIBM
Granite 4.0 1B
259-31 / +31Oct 2025
358
Liquid AI logoLiquid AI
LFM2 8B A1B
259-33 / +29Oct 2025
359
StepFun logoStepFun
Step3 VL 10B
258-32 / +28Jan 2026
360
Alibaba logoAlibaba
Qwen3 1.7B (Non-reasoning)
257-33 / +29Apr 2025
361
Meta logoMeta
Llama 3.1 Instruct 405B
256-31 / +29Jul 2024
362
Google logoGoogle
Gemma 3 4B Instruct
256-30 / +29Mar 2025
363
AI21 Labs logoAI21 Labs
Jamba Reasoning 3B
254-29 / +27Oct 2025
364
Google logoGoogle
Gemma 4 E2B (Non-reasoning)
253-27 / +25Apr 2026
365
Liquid AI logoLiquid AI
LFM2.5-1.2B-Thinking
252-32 / +28Jan 2026
366
DeepSeek logoDeepSeek
DeepSeek R1 (Jan '25)
249-29 / +27Jan 2025
367
Google logoGoogle
Gemma 3n E4B Instruct
245-32 / +30Jun 2025
368
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)
239-25 / +24Apr 2025
369
Alibaba logoAlibaba
Qwen3.5 2B (Non-reasoning)
238-24 / +25Mar 2026
370
Liquid AI logoLiquid AI
LFM2 2.6B
236-30 / +29Sep 2025
371
Liquid AI logoLiquid AI
LFM2 24B A2B
235-24 / +25Feb 2026
372
Alibaba logoAlibaba
Qwen3.5 0.8B (Non-reasoning)
234-26 / +26Mar 2026
373
Liquid AI logoLiquid AI
LFM2.5-VL-1.6B
234-34 / +28Jan 2026
374
Microsoft logoMicrosoft
Phi-4 Mini Instruct
231-30 / +27Feb 2024
375
IBM logoIBM
Granite 3.3 8B (Non-reasoning)
225-31 / +28Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

GPT-5.5 (xhigh) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,769 among models with published GDPval-AA results. View model

GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.