All evaluations

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
See example tasks

The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.
The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) scores the highest on GDPval with a score of 1890, followed by GPT-5.5 (xhigh) with a score of 1769, and GPT-5.5 (high) with a score of 1753

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis
Stirrup Agent Harness
AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation
AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index
Most attractive quadrant
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Token Usage

GDPval-AA: Output Token Usage

Output tokens used to run the evaluation
Reasoning tokens
Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

GDPval-AA Leaderboard

1
Anthropic logoAnthropic
Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
1890-34 / +35May 2026
2
OpenAI logoOpenAI
GPT-5.5 (xhigh)
1769-32 / +31Apr 2026
3
OpenAI logoOpenAI
GPT-5.5 (high)
1753-28 / +31Apr 2026
4
Anthropic logoAnthropic
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
1753-41 / +40Apr 2026
5
Anthropic logoAnthropic
Claude Opus 4.7 (Non-reasoning, High Effort)
1678-27 / +28Apr 2026
6
Anthropic logoAnthropic
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
1676-26 / +29Feb 2026
7
OpenAI logoOpenAI
GPT-5.4 (xhigh)
1674-34 / +32Mar 2026
8
Google logoGoogle
Gemini 3.5 Flash (high)
1656-26 / +30May 2026
9
Google logoGoogle
Gemini 3.5 Flash (medium)
1655-27 / +30May 2026
10
OpenAI logoOpenAI
GPT-5.5 (medium)
1652-26 / +28Apr 2026
11
Anthropic logoAnthropic
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
1619-31 / +33Feb 2026
12
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, High Effort)
1596-26 / +25Feb 2026
13
Anthropic logoAnthropic
Claude Opus 4.6 (Non-reasoning, High Effort)
1591-23 / +27Feb 2026
14
Xiaomi logoXiaomi
MiMo-V2.5-Pro
1571-27 / +28Apr 2026
15
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, High Effort)
1558-29 / +31Apr 2026
16
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, Max Effort)
1554-29 / +29Apr 2026
17
Xiaomi logoXiaomi
MiMo-V2.5
1549-25 / +26Apr 2026
18
Alibaba logoAlibaba
Qwen3.7 Max
1547-27 / +28May 2026
19
Z AI logoZ AI
GLM-5.1 (Reasoning)
1535-0 / +0Apr 2026
20
MiniMax logoMiniMax
MiniMax-M2.7
1505-24 / +26Mar 2026
21
Alibaba logoAlibaba
Qwen3.6 Max Preview
1504-20 / +21Apr 2026
22
OpenAI logoOpenAI
GPT-5.4 (low)
1503-27 / +29Mar 2026
23
Z AI logoZ AI
GLM-5-Turbo
1496-24 / +22Mar 2026
24
xAI logoxAI
Grok 4.3 (high)
1495-25 / +23Apr 2026
25
Z AI logoZ AI
GLM-5.1 (Non-reasoning)
1493-28 / +27Apr 2026
26
Kimi logoKimi
Kimi K2.6
1481-25 / +26Apr 2026
27
OpenAI logoOpenAI
GPT-5.3 Codex (xhigh)
1477-22 / +26Feb 2026
28
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Non-reasoning)
1476-24 / +26Apr 2026
29
OpenAI logoOpenAI
GPT-5.2 (xhigh)
1467-25 / +25Dec 2025
30
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, Low Effort)
1455-25 / +25Feb 2026
31
Anthropic logoAnthropic
Claude Opus 4.5 (Reasoning)
1453-24 / +25Nov 2025
32
OpenAI logoOpenAI
GPT-5.5 (low)
1443-23 / +24Apr 2026
33
Google logoGoogle
Gemini 3.5 Flash (minimal)
1440-26 / +24May 2026
34
OpenAI logoOpenAI
GPT-5.4 mini (xhigh)
1438-23 / +26Mar 2026
35
Anthropic logoAnthropic
Claude Opus 4.5 (Non-reasoning)
1420-22 / +23Nov 2025
36
Meta logoMeta
Muse Spark
1417-23 / +23Apr 2026
37
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, High Effort)
1414-25 / +26Apr 2026
38
Xiaomi logoXiaomi
MiMo-V2-Pro
1407-22 / +24Mar 2026
39
OpenAI logoOpenAI
GPT-5.2 (medium)
1405-23 / +23Dec 2025
40
Alibaba logoAlibaba
Qwen3.6 27B (Reasoning)
1404-22 / +23Apr 2026
41
Z AI logoZ AI
GLM-5 (Reasoning)
1394-23 / +23Feb 2026
42
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Non-reasoning)
1390-25 / +26Apr 2026
43
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, Max Effort)
1388-20 / +34Apr 2026
44
Alibaba logoAlibaba
Qwen3.6 27B (Non-reasoning)
1385-23 / +24Apr 2026
45
Alibaba logoAlibaba
Qwen3.6 Plus
1351-24 / +24Apr 2026
46
Xiaomi logoXiaomi
MiMo-V2-Omni-0327
1346-24 / +23Mar 2026
47
OpenAI logoOpenAI
GPT-5.4 (Non-reasoning)
1342-26 / +26Mar 2026
48
Z AI logoZ AI
GLM 5V Turbo (Reasoning)
1330-22 / +23Apr 2026
49
Kimi logoKimi
Kimi K2.6 (Non-reasoning)
1325-26 / +29Apr 2026
50
Z AI logoZ AI
GLM-5 (Non-reasoning)
1324-21 / +23Feb 2026
51
Google logoGoogle
Gemini 3 Deep Think
1324-30 / +31Feb 2026
52
 logo
Claude Pro - 4.5 Opus (Extended Thinking)
1319-41 / +38-
53
OpenAI logoOpenAI
GPT-5.4 mini (medium)
1319-22 / +24Mar 2026
54
Xiaomi logoXiaomi
MiMo-V2-Omni
1318-23 / +24Mar 2026
55
Anthropic logoAnthropic
Claude 4.5 Sonnet (Reasoning)
1318-24 / +25Sep 2025
56
OpenAI logoOpenAI
GPT-5.5 (Non-reasoning)
1316-25 / +23Apr 2026
57
Google logoGoogle
Gemini 3.1 Pro Preview
1314-26 / +27Feb 2026
58
xAI logoxAI
Grok 4.3 (medium)
1312-26 / +24Apr 2026
59
Anthropic logoAnthropic
Claude 4.5 Sonnet (Non-reasoning)
1307-22 / +26Sep 2025
60
xAI logoxAI
Grok 4.3 (Non-reasoning)
1299-24 / +25Apr 2026
61
Alibaba logoAlibaba
Qwen3.6 35B A3B (Reasoning)
1298-22 / +23Apr 2026
62
Xiaomi logoXiaomi
MiMo-V2.5-Pro (Non-reasoning)
1296-22 / +25Apr 2026
63
OpenAI logoOpenAI
GPT-5 (high)
1294-21 / +22Aug 2025
64
OpenAI logoOpenAI
GPT-5.2 Codex (xhigh)
1289-27 / +27Dec 2025
65
Kimi logoKimi
Kimi K2.5 (Reasoning)
1285-23 / +23Jan 2026
66
Kimi logoKimi
Kimi K2.5 (Non-reasoning)
1265-24 / +22Jan 2026
67
Tencent logoTencent
Hy3-preview (Reasoning)
1237-25 / +23Apr 2026
68
Tencent logoTencent
Hy3-preview (Non-reasoning)
1227-25 / +25Apr 2026
69
OpenAI logoOpenAI
GPT-5.1 (high)
1227-21 / +22Nov 2025
70
Alibaba logoAlibaba
Qwen3.6 35B A3B (Non-reasoning)
1223-24 / +24Apr 2026
71
OpenAI logoOpenAI
GPT-5.2 (Non-reasoning)
1221-22 / +24Dec 2025
72
Alibaba logoAlibaba
Qwen3.5 397B A17B (Non-reasoning)
1220-22 / +23Feb 2026
73
OpenAI logoOpenAI
GPT-5 Codex (high)
1215-24 / +22Sep 2025
74
Google logoGoogle
Gemini 3 Flash Preview (Reasoning)
1204-24 / +24Dec 2025
75
OpenAI logoOpenAI
GPT-5.4 nano (medium)
1200-22 / +23Mar 2026
76
DeepSeek logoDeepSeek
DeepSeek V3.2 (Reasoning)
1197-24 / +22Dec 2025
77
OpenAI logoOpenAI
GPT-5.4 nano (xhigh)
1194-24 / +25Mar 2026
78
OpenAI logoOpenAI
GPT-5.1 Codex (high)
1192-25 / +25Nov 2025
79
Alibaba logoAlibaba
Qwen3.5 397B A17B (Reasoning)
1190-23 / +22Feb 2026
80
Z AI logoZ AI
GLM-4.7 (Reasoning)
1186-22 / +23Dec 2025
81
Google logoGoogle
Gemini 3 Pro Preview (high)
1185-23 / +23Nov 2025
82
Alibaba logoAlibaba
Qwen3.5 Omni Plus
1185-23 / +24Mar 2026
83
OpenAI logoOpenAI
GPT-5 mini (high)
1185-22 / +22Aug 2025
84
Z AI logoZ AI
GLM-4.7 (Non-reasoning)
1177-22 / +23Dec 2025
85
MiniMax logoMiniMax
MiniMax-M2.5
1176-24 / +22Feb 2026
86
Anthropic logoAnthropic
Claude 4.5 Haiku (Reasoning)
1171-24 / +25Oct 2025
87
xAI logoxAI
Grok 4.20 0309 v2 (Reasoning)
1169-22 / +24Apr 2026
88
Mistral logoMistral
Mistral Medium 3.5
1168-25 / +24Apr 2026
89
Google logoGoogle
Gemini 3 Pro Preview (low)
1168-25 / +28Nov 2025
90
Alibaba logoAlibaba
Qwen3.5 27B (Non-reasoning)
1162-22 / +22Feb 2026
91
Alibaba logoAlibaba
Qwen3.5 27B (Reasoning)
1160-22 / +23Feb 2026
92
 logo
ChatGPT Plus - 5.1 Thinking (Extended Thinking)
1149-41 / +45-
93
OpenAI logoOpenAI
GPT-5 (low)
1148-23 / +23Aug 2025
94
OpenAI logoOpenAI
GPT-5.5 Instant (May 2026)
1143-23 / +23May 2026
95
Alibaba logoAlibaba
Qwen3 Max Thinking
1138-23 / +24Jan 2026
96
Anthropic logoAnthropic
Claude 4.5 Haiku (Non-reasoning)
1136-25 / +26Oct 2025
97
Anthropic logoAnthropic
Claude 4 Sonnet (Reasoning)
1134-24 / +26May 2025
98
Anthropic logoAnthropic
Claude 4 Sonnet (Non-reasoning)
1126-24 / +24May 2025
99
xAI logoxAI
Grok 4.3 (low)
1126-24 / +22Apr 2026
100
InclusionAI logoInclusionAI
Ring-2.6-1T
1125-23 / +26May 2026
101
KwaiKAT logoKwaiKAT
KAT Coder Pro V2
1120-24 / +23Mar 2026
102
Google logoGoogle
Gemini 3 Flash Preview (Non-reasoning)
1116-26 / +26Dec 2025
103
Alibaba logoAlibaba
Qwen3.5 122B A10B (Reasoning)
1116-23 / +22Feb 2026
104
Google logoGoogle
Gemma 4 31B (Reasoning)
1113-22 / +23Apr 2026
105
Alibaba logoAlibaba
Qwen3.5 122B A10B (Non-reasoning)
1110-21 / +24Feb 2026
106
MiniMax logoMiniMax
MiniMax-M2.1
1090-25 / +23Dec 2025
107
Xiaomi logoXiaomi
MiMo-V2-Flash (Reasoning)
1081-23 / +24Dec 2025
108
DeepSeek logoDeepSeek
DeepSeek V3.1 (Non-reasoning)
1080-25 / +23Aug 2025
109
China Mobile logoChina Mobile
JT-35B-Flash
1078-23 / +22May 2026
110
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Reasoning)
1072-23 / +25Sep 2025
111
StepFun logoStepFun
Step 3.5 Flash 2603
1070-23 / +25Apr 2026
112
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Non-reasoning)
1069-25 / +24Sep 2025
113
Xiaomi logoXiaomi
MiMo-V2-Flash (Non-reasoning)
1061-23 / +24Dec 2025
114
StepFun logoStepFun
Step 3.5 Flash
1054-26 / +27Feb 2026
115
OpenAI logoOpenAI
GPT-5.1 Codex mini (high)
1053-27 / +23Nov 2025
116
Anthropic logoAnthropic
Claude 3.7 Sonnet (Reasoning)
1049-21 / +24Feb 2025
117
Alibaba logoAlibaba
Qwen3.5 35B A3B (Non-reasoning)
1048-21 / +23Feb 2026
118
Anthropic logoAnthropic
Claude 3.7 Sonnet (Non-reasoning)
1046-24 / +23Feb 2025
119
xAI logoxAI
Grok 4.20 0309 (Reasoning)
1046-21 / +23Mar 2026
120
InclusionAI logoInclusionAI
Ling-2.6-1T
1045-23 / +23Apr 2026
121
xAI logoxAI
Grok 4.1 Fast (Reasoning)
1045-23 / +24Nov 2025
122
Xiaomi logoXiaomi
MiMo-V2-Flash (Feb 2026)
1045-23 / +24Dec 2025
123
xAI logoxAI
Grok 4.20 0309 v2 (Non-reasoning)
1039-28 / +27Apr 2026
124
Alibaba logoAlibaba
Qwen3 Max
1038-22 / +23Sep 2025
125
MiniMax logoMiniMax
MiniMax-M2
1032-26 / +25Oct 2025
126
 logo
Perplexity Pro - Labs
1032-41 / +39-
127
Z AI logoZ AI
GLM-4.6 (Reasoning)
1030-29 / +29Sep 2025
128
Google logoGoogle
Gemma 4 26B A4B (Reasoning)
1014-23 / +23Apr 2026
129
xAI logoxAI
Grok 4 Fast (Reasoning)
1014-22 / +24Sep 2025
130
OpenAI logoOpenAI
o4-mini (high)
1006-25 / +23Apr 2025
131
Google logoGoogle
Gemma 4 31B (Non-reasoning)
1005-23 / +22Apr 2026
132
OpenAI logoOpenAI
GPT-5.4 mini (Non-Reasoning)
1004-23 / +23Mar 2026
133
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)
1003-22 / +21Mar 2026
134
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Reasoning)
1003-26 / +28Sep 2025
135
OpenAI logoOpenAI
GPT-5 mini (medium)
1002-27 / +26Aug 2025
136
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Reasoning)
1001-24 / +24Sep 2025
137
OpenAI logoOpenAI
GPT-5.1 (Non-reasoning)
1000-0 / +0Nov 2025
138
OpenAI logoOpenAI
GPT-5 (medium)
997-25 / +26Aug 2025
139
MiniMax logoMiniMax
MiniMax M1 80k
996-24 / +24Jun 2025
140
Kimi logoKimi
Kimi K2 Thinking
994-23 / +24Nov 2025
141
xAI logoxAI
Grok 4
990-23 / +23Jul 2025
142
ByteDance Seed logoByteDance Seed
Doubao Seed Code
986-26 / +26Nov 2025
143
Z AI logoZ AI
GLM-4.6 (Non-reasoning)
984-26 / +26Sep 2025
144
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Non-reasoning)
976-22 / +25Sep 2025
145
Amazon logoAmazon
Nova 2.0 Pro Preview (medium)
973-27 / +23Nov 2025
146
 logo
Google AI Pro - Thinking with 3 Pro
972-43 / +43-
147
Inception logoInception
Mercury 2
958-23 / +21Feb 2026
148
Google logoGoogle
Gemma 4 26B A4B (Non-reasoning)
948-25 / +23Apr 2026
149
Alibaba logoAlibaba
Qwen3 Max Thinking (Preview)
947-25 / +25Nov 2025
150
OpenAI logoOpenAI
gpt-oss-120b (high)
947-28 / +27Aug 2025
151
OpenAI logoOpenAI
GPT-5.4 nano (Non-Reasoning)
941-30 / +32Mar 2026
152
Google logoGoogle
Gemini 3.1 Flash-Lite
926-23 / +22Mar 2026
153
Cohere logoCohere
Command A+
919-25 / +24May 2026
154
Google logoGoogle
Gemini 2.5 Pro
917-24 / +25Jun 2025
155
Alibaba logoAlibaba
Qwen3 Coder Next
914-25 / +26Feb 2026
156
xAI logoxAI
Grok 4.20 0309 (Non-reasoning)
909-24 / +25Mar 2026
157
Alibaba logoAlibaba
Qwen3.5 35B A3B (Reasoning)
907-23 / +23Feb 2026
158
Alibaba logoAlibaba
Qwen3.5 Omni Flash
897-24 / +23Mar 2026
159
 logo
SuperGrok - Grok 4
882-46 / +40-
160
DeepSeek logoDeepSeek
DeepSeek V3.2 (Non-reasoning)
877-29 / +26Dec 2025
161
Arcee AI logoArcee AI
Trinity Large Thinking
865-23 / +23Apr 2026
162
Kimi logoKimi
Kimi K2 0905
864-29 / +28Sep 2025
163
Mistral logoMistral
Mistral Large 3
863-25 / +23Dec 2025
164
Mistral logoMistral
Mistral Small 4 (Reasoning)
861-23 / +23Mar 2026
165
Mistral logoMistral
Devstral 2
856-25 / +25Dec 2025
166
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)
853-28 / +26Sep 2025
167
Amazon logoAmazon
Nova 2.0 Lite (high)
853-23 / +23Oct 2025
168
Mistral logoMistral
Mistral Small 4 (Non-reasoning)
845-23 / +23Mar 2026
169
Alibaba logoAlibaba
Qwen3.5 9B (Non-reasoning)
844-22 / +22Mar 2026
170
LongCat logoLongCat
LongCat Flash Lite
838-27 / +25Jan 2026
171
Z AI logoZ AI
GLM-4.7-Flash (Reasoning)
837-27 / +26Jan 2026
172
China Mobile logoChina Mobile
JT-MINI
831-24 / +24Apr 2026
173
Mistral logoMistral
Devstral Small (May '25)
829-27 / +26May 2025
174
OpenAI logoOpenAI
gpt-oss-120b (low)
829-24 / +23Aug 2025
175
LG AI Research logoLG AI Research
K-EXAONE (Reasoning)
826-26 / +26Dec 2025
176
Mistral logoMistral
Devstral Small 2
820-24 / +26Dec 2025
177
Alibaba logoAlibaba
Qwen3 235B A22B 2507 (Reasoning)
820-24 / +22Jul 2025
178
KwaiKAT logoKwaiKAT
KAT-Coder-Pro V1
818-26 / +24Nov 2025
179
Alibaba logoAlibaba
Qwen3 Max (Preview)
817-24 / +22Sep 2025
180
LG AI Research logoLG AI Research
EXAONE 4.5 33B
814-25 / +26Apr 2026
181
Z AI logoZ AI
GLM-4.7-Flash (Non-reasoning)
802-41 / +35Jan 2026
182
Baidu logoBaidu
ERNIE 5.0 Thinking Preview
789-28 / +24Nov 2025
183
xAI logoxAI
Grok 4.1 Fast (Non-reasoning)
784-28 / +26Nov 2025
184
Amazon logoAmazon
Nova 2.0 Omni (medium)
784-26 / +25Nov 2025
185
InclusionAI logoInclusionAI
Ling 2.6 Flash
782-23 / +22Apr 2026
186
Mistral logoMistral
Mistral Medium 3.1
781-25 / +27Aug 2025
187
xAI logoxAI
Grok 4 Fast (Non-reasoning)
779-25 / +26Sep 2025
188
Alibaba logoAlibaba
Qwen3 235B A22B 2507 Instruct
778-27 / +27Jul 2025
189
OpenAI logoOpenAI
GPT-4.1
777-27 / +27Apr 2025
190
Alibaba logoAlibaba
Qwen3 VL 4B (Reasoning)
776-39 / +40Oct 2025
191
NVIDIA logoNVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
764-26 / +26Apr 2026
192
xAI logoxAI
Grok Code Fast 1
763-26 / +26Aug 2025
193
LG AI Research logoLG AI Research
K-EXAONE (Non-reasoning)
763-26 / +23Dec 2025
194
ByteDance Seed logoByteDance Seed
Seed-OSS-36B-Instruct
759-24 / +25Aug 2025
195
Alibaba logoAlibaba
Qwen3 235B A22B (Reasoning)
756-29 / +27Apr 2025
196
NVIDIA logoNVIDIA
Nemotron Cascade 2 30B A3B
756-24 / +25Mar 2026
197
OpenAI logoOpenAI
GPT-5 nano (high)
755-25 / +24Aug 2025
198
OpenAI logoOpenAI
o3
754-29 / +30Apr 2025
199
Prime Intellect logoPrime Intellect
INTELLECT-3
750-26 / +24Nov 2025
200
OpenAI logoOpenAI
o3-mini (high)
747-27 / +28Jan 2025
201
Google logoGoogle
Gemini 2.5 Flash (Non-reasoning)
741-28 / +26May 2025
202
Alibaba logoAlibaba
Qwen3 235B A22B (Non-reasoning)
739-27 / +26Apr 2025
203
Sarvam logoSarvam
Sarvam 105B (high)
738-23 / +23Mar 2026
204
OpenAI logoOpenAI
o1
736-28 / +27Dec 2024
205
Alibaba logoAlibaba
Qwen3 Next 80B A3B (Reasoning)
726-24 / +26Sep 2025
206
Alibaba logoAlibaba
Qwen3.5 9B (Reasoning)
715-23 / +22Mar 2026
207
Alibaba logoAlibaba
Qwen3 VL 235B A22B (Reasoning)
713-24 / +25Sep 2025
208
Alibaba logoAlibaba
Qwen3 Coder 30B A3B Instruct
710-24 / +25Jul 2025
209
Anthropic logoAnthropic
Claude 3.5 Haiku
708-26 / +24Oct 2024
210
Google logoGoogle
Gemini 2.5 Flash (Reasoning)
699-31 / +28May 2025
211
Z AI logoZ AI
GLM-4.6V (Non-reasoning)
692-28 / +27Dec 2025
212
Mistral logoMistral
Devstral Medium
690-27 / +26Jul 2025
213
InclusionAI logoInclusionAI
Ring-1T
687-27 / +28Oct 2025
214
Alibaba logoAlibaba
Qwen3 VL 8B Instruct
683-37 / +38Oct 2025
215
DeepSeek logoDeepSeek
DeepSeek R1 0528 (May '25)
681-28 / +26May 2025
216
Naver logoNaver
HyperCLOVA X SEED Think (32B)
681-26 / +27Dec 2025
217
Upstage logoUpstage
Solar Pro 3
675-23 / +23Apr 2026
218
Mistral logoMistral
Magistral Small 1.2
670-26 / +24Sep 2025
219
Alibaba logoAlibaba
Qwen3.5 4B (Non-reasoning)
669-24 / +22Mar 2026
220
Alibaba logoAlibaba
Qwen3 VL 8B (Reasoning)
669-29 / +30Oct 2025
221
Alibaba logoAlibaba
Qwen3 VL 30B A3B (Reasoning)
667-38 / +37Oct 2025
222
xAI logoxAI
Grok 3
667-26 / +28Feb 2025
223
Mistral logoMistral
Magistral Medium 1
666-27 / +26Jun 2025
224
Upstage logoUpstage
Solar Open 100B (Reasoning)
665-28 / +28Dec 2025
225
Alibaba logoAlibaba
Qwen3 30B A3B 2507 (Reasoning)
662-25 / +25Jul 2025
226
Amazon logoAmazon
Nova 2.0 Pro Preview (low)
660-28 / +25Nov 2025
227
Mistral logoMistral
Ministral 3 14B
656-25 / +26Dec 2025
228
OpenAI logoOpenAI
gpt-oss-20B (high)
651-26 / +24Aug 2025
229
Alibaba logoAlibaba
Qwen3 VL 32B (Reasoning)
647-26 / +26Oct 2025
230
Amazon logoAmazon
Nova 2.0 Lite (medium)
644-26 / +28Oct 2025
231
Korea Telecom logoKorea Telecom
Mi:dm K 2.5 Pro
642-28 / +26Dec 2025
232
Mistral logoMistral
Ministral 3 8B
639-27 / +27Dec 2025
233
Alibaba logoAlibaba
Qwen3 VL 235B A22B Instruct
636-38 / +37Sep 2025
234
Mistral logoMistral
Magistral Medium 1.2
628-30 / +28Sep 2025
235
Alibaba logoAlibaba
Qwen3 Next 80B A3B Instruct
627-28 / +29Sep 2025
236
OpenAI logoOpenAI
GPT-4.1 mini
621-27 / +25Apr 2025
237
DeepSeek logoDeepSeek
DeepSeek V3.1 (Reasoning)
613-28 / +28Aug 2025
238
Z AI logoZ AI
GLM-4.6V (Reasoning)
609-29 / +28Dec 2025
239
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2 Think V2
608-27 / +24Dec 2025
240
OpenAI logoOpenAI
GPT-5 nano (medium)
594-26 / +28Aug 2025
241
Alibaba logoAlibaba
Qwen3 4B 2507 (Reasoning)
590-27 / +27Aug 2025
242
Mistral logoMistral
Mistral Medium 3
586-27 / +28May 2025
243
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Reasoning)
586-24 / +24Aug 2025
244
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (medium)
582-27 / +26Dec 2025
245
ServiceNow logoServiceNow
Apriel-v1.6-15B-Thinker
574-27 / +27Nov 2025
246
Google logoGoogle
Gemini 2.0 Flash (Feb '25)
568-26 / +26Feb 2025
247
Mistral logoMistral
Devstral Small (Jul '25)
565-28 / +27Jul 2025
248
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
565-26 / +25Dec 2025
249
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (high)
561-27 / +26Dec 2025
250
Z AI logoZ AI
GLM-4.5-Air
560-29 / +29Jul 2025
251
OpenAI logoOpenAI
gpt-oss-20B (low)
550-29 / +25Aug 2025
252
IBM logoIBM
Granite 4.1 8B
543-25 / +24Apr 2026
253
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Reasoning)
538-24 / +22Aug 2025
254
Kimi logoKimi
Kimi K2
527-31 / +34Jul 2025
255
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Non-reasoning)
523-26 / +25Aug 2025
256
Alibaba logoAlibaba
Qwen3 30B A3B 2507 Instruct
517-28 / +28Jul 2025
257
Z AI logoZ AI
GLM-4.5V (Reasoning)
511-23 / +22Aug 2025
258
Alibaba logoAlibaba
Qwen3.5 4B (Reasoning)
510-28 / +28Mar 2026
259
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Non-reasoning)
510-24 / +23Aug 2025
260
Amazon logoAmazon
Nova 2.0 Lite (low)
507-30 / +25Oct 2025
261
Alibaba logoAlibaba
Qwen3 Coder 480B A35B Instruct
507-31 / +28Jul 2025
262
Amazon logoAmazon
Nova Premier
506-28 / +30Apr 2025
263
Alibaba logoAlibaba
Qwen3 30B A3B (Reasoning)
505-27 / +27Apr 2025
264
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Reasoning)
502-27 / +26Jul 2025
265
Allen Institute for AI logoAllen Institute for AI
Molmo2-8B
500-0 / +0Dec 2025
266
DeepSeek logoDeepSeek
DeepSeek V3.2 Speciale
500-0 / +0Dec 2025
267
Alibaba logoAlibaba
Qwen3 Omni 30B A3B (Reasoning)
497-26 / +24Sep 2025
268
Alibaba logoAlibaba
Qwen3 8B (Reasoning)
497-26 / +27Apr 2025
269
IBM logoIBM
Granite 4.1 30B
496-25 / +23Apr 2026
270
Alibaba logoAlibaba
Qwen3 VL 30B A3B Instruct
496-29 / +25Oct 2025
271
Alibaba logoAlibaba
Qwen3 32B (Reasoning)
491-27 / +26Apr 2025
272
Motif Technologies logoMotif Technologies
Motif-2-12.7B-Reasoning
484-31 / +29Dec 2025
273
Mistral logoMistral
Ministral 3 3B
484-30 / +28Dec 2025
274
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 4B
479-29 / +26Mar 2026
275
Alibaba logoAlibaba
Qwen3 14B (Reasoning)
478-27 / +26Apr 2025
276
Alibaba logoAlibaba
Qwen3 14B (Non-reasoning)
473-27 / +25Apr 2025
277
Alibaba logoAlibaba
Qwen3 8B (Non-reasoning)
472-26 / +24Apr 2025
278
OpenAI logoOpenAI
GPT-5 mini (minimal)
472-29 / +28Aug 2025
279
Z AI logoZ AI
GLM-4.5 (Reasoning)
468-34 / +30Jul 2025
280
Z AI logoZ AI
GLM-4.5V (Non-reasoning)
460-28 / +27Aug 2025
281
Upstage logoUpstage
Solar Pro 2 (Reasoning)
451-26 / +26Jul 2025
282
Upstage logoUpstage
Solar Pro 2 (Non-reasoning)
446-28 / +27Jul 2025
283
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Reasoning)
439-28 / +27Aug 2025
284
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)
437-32 / +30Sep 2025
285
Meta logoMeta
Llama 4 Maverick
437-28 / +26Apr 2025
286
InclusionAI logoInclusionAI
Ling-flash-2.0
420-30 / +27Sep 2025
287
xAI logoxAI
Grok 3 mini Reasoning (high)
420-38 / +37Feb 2025
288
DeepSeek logoDeepSeek
DeepSeek V3 (Dec '24)
410-30 / +27Dec 2024
289
DeepSeek logoDeepSeek
DeepSeek V3 0324
406-32 / +29Mar 2025
290
Meta logoMeta
Llama 3.3 Instruct 70B
401-29 / +28Dec 2024
291
InclusionAI logoInclusionAI
Ling-1T
401-28 / +27Oct 2025
292
Amazon logoAmazon
Nova Pro
388-28 / +28Dec 2024
293
OpenAI logoOpenAI
GPT-5 (minimal)
386-31 / +27Aug 2025
294
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)
382-29 / +29Sep 2025
295
Amazon logoAmazon
Nova 2.0 Lite (Non-reasoning)
381-29 / +29Oct 2025
296
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Non-reasoning)
380-27 / +26Jul 2025
297
Anthropic logoAnthropic
Claude 3 Haiku
379-27 / +25Mar 2024
298
OpenAI logoOpenAI
GPT-4o (Aug '24)
378-29 / +27Aug 2024
299
Trillion Labs logoTrillion Labs
Tri-21B-Think
373-26 / +23Feb 2026
300
TII UAE logoTII UAE
Falcon-H1R-7B
373-32 / +29Jan 2026
301
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Reasoning)
368-31 / +28Jul 2025
302
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (low)
366-29 / +28Dec 2025
303
IBM logoIBM
Granite 4.1 3B
366-24 / +24Apr 2026
304
Amazon logoAmazon
Nova 2.0 Omni (low)
360-32 / +30Nov 2025
305
Sarvam logoSarvam
Sarvam 30B (high)
359-26 / +23Mar 2026
306
Allen Institute for AI logoAllen Institute for AI
Olmo 3.1 32B Instruct
357-31 / +29Jan 2026
307
Nanbeige logoNanbeige
Nanbeige4.1-3B
357-30 / +30Feb 2026
308
OpenAI logoOpenAI
GPT-4o (Nov '24)
349-25 / +23Nov 2024
309
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)
348-29 / +28Dec 2025
310
IBM logoIBM
Granite 4.0 H Small
344-29 / +28Sep 2025
311
Alibaba logoAlibaba
Qwen3 VL 4B Instruct
344-27 / +26Oct 2025
312
Amazon logoAmazon
Nova Lite
344-30 / +30Dec 2024
313
Amazon logoAmazon
Nova Micro
340-29 / +28Dec 2024
314
Mistral logoMistral
Mistral Small 3.1
337-30 / +28Mar 2025
315
Trillion Labs logoTrillion Labs
Tri-21B-think Preview
337-33 / +30Feb 2026
316
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Instruct 70B
337-30 / +26Oct 2024
317
Alibaba logoAlibaba
Qwen3 30B A3B (Non-reasoning)
331-29 / +27Apr 2025
318
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Non-reasoning)
329-33 / +28Jul 2025
319
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Reasoning)
328-28 / +29Oct 2025
320
Mistral logoMistral
Mistral Large 2 (Nov '24)
324-33 / +30Nov 2024
321
Alibaba logoAlibaba
Qwen3.5 2B (Reasoning)
321-24 / +22Mar 2026
322
Google logoGoogle
Gemini 2.5 Flash-Lite (Reasoning)
320-29 / +30Jun 2025
323
Amazon logoAmazon
Nova 2.0 Pro Preview (Non-reasoning)
319-30 / +26Nov 2025
324
OpenAI logoOpenAI
GPT-4.1 nano
318-29 / +27Apr 2025
325
Alibaba logoAlibaba
Qwen3 0.6B (Reasoning)
315-29 / +29Apr 2025
326
Alibaba logoAlibaba
Qwen3 4B 2507 Instruct
306-32 / +29Aug 2025
327
Amazon logoAmazon
Nova 2.0 Omni (Non-reasoning)
305-31 / +27Nov 2025
328
Google logoGoogle
Gemini 2.5 Flash-Lite (Non-reasoning)
305-31 / +29Jun 2025
329
Mistral logoMistral
Mistral Small 3.2
304-30 / +31Jun 2025
330
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Non-reasoning)
303-31 / +30Aug 2025
331
Google logoGoogle
Gemma 4 E4B (Reasoning)
303-25 / +25Apr 2026
332
Alibaba logoAlibaba
Qwen3 VL 32B Instruct
301-31 / +30Oct 2025
333
Alibaba logoAlibaba
Qwen3 Omni 30B A3B Instruct
296-32 / +33Sep 2025
334
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Non-reasoning)
295-27 / +29Jul 2025
335
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Reasoning)
294-28 / +28Jul 2025
336
Google logoGoogle
Gemma 4 E4B (Non-reasoning)
293-28 / +27Apr 2026
337
IBM logoIBM
Granite 4.0 H 350M
292-28 / +30Oct 2025
338
Google logoGoogle
Gemma 3 27B Instruct
285-30 / +29Mar 2025
339
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)
285-28 / +28Oct 2025
340
AI21 Labs logoAI21 Labs
Jamba 1.7 Large
284-34 / +28Jul 2025
341
Meta logoMeta
Llama 3.1 Instruct 70B
284-30 / +27Jul 2024
342
OpenAI logoOpenAI
GPT-5 nano (minimal)
282-30 / +28Aug 2025
343
Google logoGoogle
Gemma 3 12B Instruct
281-30 / +29Mar 2025
344
Alibaba logoAlibaba
Qwen3 0.6B (Non-reasoning)
279-35 / +31Apr 2025
345
Meta logoMeta
Llama 3.1 Instruct 8B
278-31 / +31Jul 2024
346
IBM logoIBM
Granite 4.0 Micro
278-32 / +27Sep 2025
347
Allen Institute for AI logoAllen Institute for AI
Olmo 3 7B Instruct
277-30 / +27Nov 2025
348
Alibaba logoAlibaba
Qwen3.5 0.8B (Reasoning)
277-24 / +24Mar 2026
349
AI21 Labs logoAI21 Labs
Jamba 1.7 Mini
276-29 / +30Jul 2025
350
Alibaba logoAlibaba
Qwen3 1.7B (Reasoning)
275-29 / +28Apr 2025
351
Cohere logoCohere
Command A
275-31 / +29Mar 2025
352
Meta logoMeta
Llama 4 Scout
272-30 / +29Apr 2025
353
Liquid AI logoLiquid AI
LFM2 1.2B
271-31 / +29Jul 2025
354
Google logoGoogle
Gemma 4 E2B (Reasoning)
270-26 / +23Apr 2026
355
IBM logoIBM
Granite 4.0 H 1B
268-29 / +29Oct 2025
356
IBM logoIBM
Granite 4.0 350M
268-31 / +30Oct 2025
357
Liquid AI logoLiquid AI
LFM2.5-1.2B-Instruct
266-33 / +31Jan 2026
358
InclusionAI logoInclusionAI
Ling-mini-2.0
262-25 / +23Sep 2025
359
Liquid AI logoLiquid AI
LFM2 8B A1B
259-32 / +30Oct 2025
360
StepFun logoStepFun
Step3 VL 10B
259-30 / +28Jan 2026
361
IBM logoIBM
Granite 4.0 1B
259-32 / +32Oct 2025
362
OpenBMB logoOpenBMB
MiniCPM-V 4.6 1.3B
259-33 / +29May 2026
363
Google logoGoogle
Gemma 3 4B Instruct
256-31 / +29Mar 2025
364
Meta logoMeta
Llama 3.1 Instruct 405B
256-33 / +30Jul 2024
365
AI21 Labs logoAI21 Labs
Jamba Reasoning 3B
255-29 / +30Oct 2025
366
Alibaba logoAlibaba
Qwen3 1.7B (Non-reasoning)
254-30 / +27Apr 2025
367
Google logoGoogle
Gemma 4 E2B (Non-reasoning)
253-26 / +26Apr 2026
368
Liquid AI logoLiquid AI
LFM2.5-1.2B-Thinking
253-31 / +28Jan 2026
369
DeepSeek logoDeepSeek
DeepSeek R1 (Jan '25)
249-29 / +30Jan 2025
370
Google logoGoogle
Gemma 3n E4B Instruct
244-31 / +31Jun 2025
371
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)
239-24 / +24Apr 2025
372
Alibaba logoAlibaba
Qwen3.5 2B (Non-reasoning)
239-26 / +21Mar 2026
373
Liquid AI logoLiquid AI
LFM2 24B A2B
236-26 / +24Feb 2026
374
Liquid AI logoLiquid AI
LFM2 2.6B
236-29 / +28Sep 2025
375
Alibaba logoAlibaba
Qwen3.5 0.8B (Non-reasoning)
234-25 / +26Mar 2026
376
Liquid AI logoLiquid AI
LFM2.5-VL-1.6B
233-36 / +30Jan 2026
377
OpenBMB logoOpenBMB
MiniCPM5-1B (Non-reasoning)
232-28 / +28May 2026
378
Microsoft logoMicrosoft
Phi-4 Mini Instruct
229-28 / +29Feb 2024
379
IBM logoIBM
Granite 3.3 8B (Non-reasoning)
225-30 / +26Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,890 among models with published GDPval-AA results. View model

GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark LeaderboardITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.