All evaluations

GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
See example tasks

The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.
The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval

GPT-5.5 (xhigh) scores the highest on GDPval with a score of 1772, followed by Claude Opus 4.7 (Adaptive Reasoning, Max Effort) with a score of 1753, and GPT-5.5 (high) with a score of 1751

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis
Stirrup Agent Harness
AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation
AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index
Most attractive quadrant
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Token Usage

GDPval-AA: Token Usage

Tokens used to run the evaluation
Input tokens
Reasoning tokens
Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

GDPval-AA: Cost Breakdown

Cost (USD) to run the evaluation
Input cost
Reasoning cost
Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
Upstage
xAI
Xiaomi
Z AI

GDPval-AA Leaderboard

1
OpenAI logoOpenAI
GPT-5.5 (xhigh)
1772-30 / +31Apr 2026
2
Anthropic logoAnthropic
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
1753-41 / +40Apr 2026
3
OpenAI logoOpenAI
GPT-5.5 (high)
1751-30 / +33Apr 2026
4
Anthropic logoAnthropic
Claude Opus 4.7 (Non-reasoning, High Effort)
1687-28 / +31Apr 2026
5
Anthropic logoAnthropic
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
1675-26 / +28Feb 2026
6
OpenAI logoOpenAI
GPT-5.4 (xhigh)
1674-34 / +32Mar 2026
7
OpenAI logoOpenAI
GPT-5.5 (medium)
1652-26 / +28Apr 2026
8
Anthropic logoAnthropic
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
1619-31 / +33Feb 2026
9
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, High Effort)
1592-26 / +26Feb 2026
10
Anthropic logoAnthropic
Claude Opus 4.6 (Non-reasoning, High Effort)
1590-26 / +28Feb 2026
11
Xiaomi logoXiaomi
MiMo-V2.5-Pro
1572-27 / +30Apr 2026
12
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, High Effort)
1558-29 / +31Apr 2026
13
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Reasoning, Max Effort)
1554-29 / +29Apr 2026
14
Xiaomi logoXiaomi
MiMo-V2.5
1553-26 / +27Apr 2026
15
Z AI logoZ AI
GLM-5.1 (Reasoning)
1535-0 / +0Apr 2026
16
MiniMax logoMiniMax
MiniMax-M2.7
1509-24 / +25Mar 2026
17
Alibaba logoAlibaba
Qwen3.6 Max Preview
1504-21 / +22Apr 2026
18
OpenAI logoOpenAI
GPT-5.4 (low)
1503-27 / +29Mar 2026
19
Z AI logoZ AI
GLM-5.1 (Non-reasoning)
1497-28 / +30Apr 2026
20
xAI logoxAI
Grok 4.3 (high)
1496-25 / +23Apr 2026
21
Z AI logoZ AI
GLM-5-Turbo
1496-24 / +24Mar 2026
22
Kimi logoKimi
Kimi K2.6
1483-26 / +27Apr 2026
23
OpenAI logoOpenAI
GPT-5.3 Codex (xhigh)
1479-25 / +26Feb 2026
24
DeepSeek logoDeepSeek
DeepSeek V4 Pro (Non-reasoning)
1477-26 / +27Apr 2026
25
OpenAI logoOpenAI
GPT-5.2 (xhigh)
1467-25 / +27Dec 2025
26
Anthropic logoAnthropic
Claude Sonnet 4.6 (Non-reasoning, Low Effort)
1455-25 / +25Feb 2026
27
Anthropic logoAnthropic
Claude Opus 4.5 (Reasoning)
1447-26 / +26Nov 2025
28
OpenAI logoOpenAI
GPT-5.5 (low)
1442-25 / +26Apr 2026
29
OpenAI logoOpenAI
GPT-5.4 mini (xhigh)
1437-24 / +27Mar 2026
30
Anthropic logoAnthropic
Claude Opus 4.5 (Non-reasoning)
1419-24 / +21Nov 2025
31
Meta logoMeta
Muse Spark
1418-24 / +24Apr 2026
32
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, High Effort)
1414-25 / +26Apr 2026
33
Xiaomi logoXiaomi
MiMo-V2-Pro
1410-25 / +24Mar 2026
34
Alibaba logoAlibaba
Qwen3.6 27B (Reasoning)
1408-24 / +25Apr 2026
35
OpenAI logoOpenAI
GPT-5.2 (medium)
1403-22 / +24Dec 2025
36
Z AI logoZ AI
GLM-5 (Reasoning)
1396-23 / +23Feb 2026
37
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Non-reasoning)
1394-28 / +26Apr 2026
38
Alibaba logoAlibaba
Qwen3.6 27B (Non-reasoning)
1390-26 / +23Apr 2026
39
DeepSeek logoDeepSeek
DeepSeek V4 Flash (Reasoning, Max Effort)
1388-20 / +34Apr 2026
40
Alibaba logoAlibaba
Qwen3.6 Plus
1350-24 / +23Apr 2026
41
Xiaomi logoXiaomi
MiMo-V2-Omni-0327
1344-22 / +24Mar 2026
42
OpenAI logoOpenAI
GPT-5.4 (Non-reasoning)
1342-26 / +26Mar 2026
43
Z AI logoZ AI
GLM 5V Turbo (Reasoning)
1330-23 / +25Apr 2026
44
Google logoGoogle
Gemini 3 Deep Think
1324-30 / +31Feb 2026
45
Kimi logoKimi
Kimi K2.6 (Non-reasoning)
1323-28 / +28Apr 2026
46
Z AI logoZ AI
GLM-5 (Non-reasoning)
1323-22 / +24Feb 2026
47
Anthropic logoAnthropic
Claude 4.5 Sonnet (Reasoning)
1321-24 / +26Sep 2025
48
Xiaomi logoXiaomi
MiMo-V2-Omni
1320-25 / +23Mar 2026
49
OpenAI logoOpenAI
GPT-5.4 mini (medium)
1319-23 / +23Mar 2026
50
 logo
Claude Pro - 4.5 Opus (Extended Thinking)
1319-41 / +38-
51
Google logoGoogle
Gemini 3.1 Pro Preview
1314-26 / +27Feb 2026
52
OpenAI logoOpenAI
GPT-5.5 (Non-reasoning)
1314-24 / +26Apr 2026
53
Anthropic logoAnthropic
Claude 4.5 Sonnet (Non-reasoning)
1311-23 / +23Sep 2025
54
xAI logoxAI
Grok 4.3 (Non-reasoning)
1306-25 / +26Apr 2026
55
Xiaomi logoXiaomi
MiMo-V2.5-Pro (Non-reasoning)
1296-24 / +25Apr 2026
56
Alibaba logoAlibaba
Qwen3.6 35B A3B (Reasoning)
1296-24 / +25Apr 2026
57
OpenAI logoOpenAI
GPT-5 (high)
1294-21 / +22Aug 2025
58
OpenAI logoOpenAI
GPT-5.2 Codex (xhigh)
1289-27 / +29Dec 2025
59
Kimi logoKimi
Kimi K2.5 (Reasoning)
1287-24 / +24Jan 2026
60
Kimi logoKimi
Kimi K2.5 (Non-reasoning)
1267-25 / +25Jan 2026
61
Tencent logoTencent
Hy3-preview (Reasoning)
1237-25 / +23Apr 2026
62
Tencent logoTencent
Hy3-preview (Non-reasoning)
1225-28 / +26Apr 2026
63
OpenAI logoOpenAI
GPT-5.1 (high)
1225-22 / +24Nov 2025
64
Alibaba logoAlibaba
Qwen3.5 397B A17B (Non-reasoning)
1223-23 / +23Feb 2026
65
OpenAI logoOpenAI
GPT-5.2 (Non-reasoning)
1223-23 / +23Dec 2025
66
Alibaba logoAlibaba
Qwen3.6 35B A3B (Non-reasoning)
1220-24 / +23Apr 2026
67
OpenAI logoOpenAI
GPT-5 Codex (high)
1213-24 / +23Sep 2025
68
Google logoGoogle
Gemini 3 Flash Preview (Reasoning)
1204-24 / +24Dec 2025
69
OpenAI logoOpenAI
GPT-5.4 nano (medium)
1201-23 / +24Mar 2026
70
DeepSeek logoDeepSeek
DeepSeek V3.2 (Reasoning)
1197-24 / +22Dec 2025
71
OpenAI logoOpenAI
GPT-5.1 Codex (high)
1192-26 / +26Nov 2025
72
Alibaba logoAlibaba
Qwen3.5 397B A17B (Reasoning)
1191-22 / +24Feb 2026
73
OpenAI logoOpenAI
GPT-5.4 nano (xhigh)
1190-26 / +24Mar 2026
74
OpenAI logoOpenAI
GPT-5 mini (high)
1186-23 / +22Aug 2025
75
Alibaba logoAlibaba
Qwen3.5 Omni Plus
1185-22 / +23Mar 2026
76
Google logoGoogle
Gemini 3 Pro Preview (high)
1185-22 / +23Nov 2025
77
Z AI logoZ AI
GLM-4.7 (Reasoning)
1184-23 / +23Dec 2025
78
MiniMax logoMiniMax
MiniMax-M2.5
1181-25 / +24Feb 2026
79
Z AI logoZ AI
GLM-4.7 (Non-reasoning)
1175-24 / +23Dec 2025
80
xAI logoxAI
Grok 4.20 0309 v2 (Reasoning)
1173-22 / +23Apr 2026
81
Anthropic logoAnthropic
Claude 4.5 Haiku (Reasoning)
1172-24 / +24Oct 2025
82
Mistral logoMistral
Mistral Medium 3.5
1168-25 / +24Apr 2026
83
Google logoGoogle
Gemini 3 Pro Preview (low)
1168-27 / +26Nov 2025
84
Alibaba logoAlibaba
Qwen3.5 27B (Non-reasoning)
1160-23 / +21Feb 2026
85
Alibaba logoAlibaba
Qwen3.5 27B (Reasoning)
1156-23 / +22Feb 2026
86
OpenAI logoOpenAI
GPT-5 (low)
1150-23 / +23Aug 2025
87
 logo
ChatGPT Plus - 5.1 Thinking (Extended Thinking)
1149-41 / +45-
88
Alibaba logoAlibaba
Qwen3 Max Thinking
1137-24 / +23Jan 2026
89
Anthropic logoAnthropic
Claude 4.5 Haiku (Non-reasoning)
1136-27 / +26Oct 2025
90
Anthropic logoAnthropic
Claude 4 Sonnet (Reasoning)
1134-28 / +27May 2025
91
Anthropic logoAnthropic
Claude 4 Sonnet (Non-reasoning)
1130-26 / +24May 2025
92
KwaiKAT logoKwaiKAT
KAT Coder Pro V2
1120-24 / +22Mar 2026
93
Google logoGoogle
Gemini 3 Flash Preview (Non-reasoning)
1116-25 / +28Dec 2025
94
Alibaba logoAlibaba
Qwen3.5 122B A10B (Reasoning)
1115-24 / +22Feb 2026
95
Google logoGoogle
Gemma 4 31B (Reasoning)
1114-22 / +23Apr 2026
96
Alibaba logoAlibaba
Qwen3.5 122B A10B (Non-reasoning)
1112-24 / +23Feb 2026
97
MiniMax logoMiniMax
MiniMax-M2.1
1088-25 / +26Dec 2025
98
Xiaomi logoXiaomi
MiMo-V2-Flash (Reasoning)
1080-27 / +24Dec 2025
99
DeepSeek logoDeepSeek
DeepSeek V3.1 (Non-reasoning)
1080-24 / +23Aug 2025
100
China Mobile logoChina Mobile
JT-35B-Flash
1075-23 / +26-
101
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Non-reasoning)
1075-25 / +25Sep 2025
102
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Reasoning)
1072-24 / +23Sep 2025
103
StepFun logoStepFun
Step 3.5 Flash 2603
1069-24 / +23Apr 2026
104
Xiaomi logoXiaomi
MiMo-V2-Flash (Non-reasoning)
1063-26 / +24Dec 2025
105
StepFun logoStepFun
Step 3.5 Flash
1054-26 / +26Feb 2026
106
Alibaba logoAlibaba
Qwen3.5 35B A3B (Non-reasoning)
1051-22 / +21Feb 2026
107
Anthropic logoAnthropic
Claude 3.7 Sonnet (Reasoning)
1049-26 / +23Feb 2025
108
OpenAI logoOpenAI
GPT-5.1 Codex mini (high)
1049-24 / +26Nov 2025
109
Anthropic logoAnthropic
Claude 3.7 Sonnet (Non-reasoning)
1046-25 / +24Feb 2025
110
InclusionAI logoInclusionAI
Ling-2.6-1T
1044-23 / +23Apr 2026
111
xAI logoxAI
Grok 4.20 0309 (Reasoning)
1044-22 / +21Mar 2026
112
Xiaomi logoXiaomi
MiMo-V2-Flash (Feb 2026)
1044-26 / +26Dec 2025
113
xAI logoxAI
Grok 4.1 Fast (Reasoning)
1043-24 / +25Nov 2025
114
Alibaba logoAlibaba
Qwen3 Max
1040-24 / +23Sep 2025
115
xAI logoxAI
Grok 4.20 0309 v2 (Non-reasoning)
1038-28 / +28Apr 2026
116
 logo
Perplexity Pro - Labs
1032-41 / +39-
117
MiniMax logoMiniMax
MiniMax-M2
1031-26 / +26Oct 2025
118
Z AI logoZ AI
GLM-4.6 (Reasoning)
1029-27 / +28Sep 2025
119
xAI logoxAI
Grok 4 Fast (Reasoning)
1015-23 / +25Sep 2025
120
Google logoGoogle
Gemma 4 26B A4B (Reasoning)
1013-24 / +24Apr 2026
121
OpenAI logoOpenAI
o4-mini (high)
1007-24 / +22Apr 2025
122
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Reasoning)
1006-27 / +28Sep 2025
123
OpenAI logoOpenAI
GPT-5.4 mini (Non-Reasoning)
1005-24 / +25Mar 2026
124
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Super 120B A12B (Reasoning)
1004-23 / +21Mar 2026
125
DeepSeek logoDeepSeek
DeepSeek V3.2 Exp (Reasoning)
1004-26 / +23Sep 2025
126
Google logoGoogle
Gemma 4 31B (Non-reasoning)
1003-24 / +22Apr 2026
127
OpenAI logoOpenAI
GPT-5 mini (medium)
1002-28 / +27Aug 2025
128
OpenAI logoOpenAI
GPT-5.1 (Non-reasoning)
1000-0 / +0Nov 2025
129
OpenAI logoOpenAI
GPT-5 (medium)
1000-27 / +26Aug 2025
130
MiniMax logoMiniMax
MiniMax M1 80k
993-25 / +25Jun 2025
131
Kimi logoKimi
Kimi K2 Thinking
993-25 / +23Nov 2025
132
ByteDance Seed logoByteDance Seed
Doubao Seed Code
986-27 / +26Nov 2025
133
Z AI logoZ AI
GLM-4.6 (Non-reasoning)
986-24 / +25Sep 2025
134
xAI logoxAI
Grok 4
982-24 / +24Jul 2025
135
DeepSeek logoDeepSeek
DeepSeek V3.1 Terminus (Non-reasoning)
976-24 / +26Sep 2025
136
Amazon logoAmazon
Nova 2.0 Pro Preview (medium)
974-27 / +23Nov 2025
137
 logo
Google AI Pro - Thinking with 3 Pro
972-43 / +43-
138
Inception logoInception
Mercury 2
958-23 / +23Feb 2026
139
Alibaba logoAlibaba
Qwen3 Max Thinking (Preview)
949-27 / +27Nov 2025
140
Google logoGoogle
Gemma 4 26B A4B (Non-reasoning)
948-22 / +21Apr 2026
141
OpenAI logoOpenAI
gpt-oss-120B (high)
947-29 / +27Aug 2025
142
OpenAI logoOpenAI
GPT-5.4 nano (Non-Reasoning)
945-31 / +29Mar 2026
143
Google logoGoogle
Gemini 3.1 Flash-Lite Preview
926-22 / +23Mar 2026
144
Alibaba logoAlibaba
Qwen3 Coder Next
913-23 / +24Feb 2026
145
Google logoGoogle
Gemini 2.5 Pro
913-25 / +26Jun 2025
146
xAI logoxAI
Grok 4.20 0309 (Non-reasoning)
910-23 / +23Mar 2026
147
Alibaba logoAlibaba
Qwen3.5 35B A3B (Reasoning)
909-21 / +24Feb 2026
148
Alibaba logoAlibaba
Qwen3.5 Omni Flash
898-25 / +25Mar 2026
149
 logo
SuperGrok - Grok 4
882-46 / +40-
150
DeepSeek logoDeepSeek
DeepSeek V3.2 (Non-reasoning)
878-27 / +27Dec 2025
151
Arcee AI logoArcee AI
Trinity Large Thinking
866-24 / +23Apr 2026
152
Kimi logoKimi
Kimi K2 0905
865-29 / +29Sep 2025
153
Mistral logoMistral
Mistral Large 3
863-25 / +23Dec 2025
154
Mistral logoMistral
Mistral Small 4 (Reasoning)
861-24 / +22Mar 2026
155
Mistral logoMistral
Devstral 2
856-25 / +26Dec 2025
156
Amazon logoAmazon
Nova 2.0 Lite (high)
851-26 / +23Oct 2025
157
Google logoGoogle
Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)
851-27 / +27Sep 2025
158
Mistral logoMistral
Mistral Small 4 (Non-reasoning)
846-24 / +20Mar 2026
159
Alibaba logoAlibaba
Qwen3.5 9B (Non-reasoning)
843-23 / +23Mar 2026
160
Z AI logoZ AI
GLM-4.7-Flash (Reasoning)
838-26 / +24Jan 2026
161
LongCat logoLongCat
LongCat Flash Lite
838-27 / +25Jan 2026
162
Mistral logoMistral
Devstral Small (May '25)
833-26 / +27May 2025
163
OpenAI logoOpenAI
gpt-oss-120B (low)
832-24 / +24Aug 2025
164
China Mobile logoChina Mobile
JT-MINI
831-25 / +24-
165
LG AI Research logoLG AI Research
K-EXAONE (Reasoning)
826-27 / +26Dec 2025
166
Alibaba logoAlibaba
Qwen3 235B A22B 2507 (Reasoning)
822-25 / +23Jul 2025
167
Alibaba logoAlibaba
Qwen3 Max (Preview)
819-25 / +23Sep 2025
168
KwaiKAT logoKwaiKAT
KAT-Coder-Pro V1
819-26 / +26Nov 2025
169
Mistral logoMistral
Devstral Small 2
818-26 / +25Dec 2025
170
LG AI Research logoLG AI Research
EXAONE 4.5 33B
813-24 / +24Apr 2026
171
Z AI logoZ AI
GLM-4.7-Flash (Non-reasoning)
802-38 / +36Jan 2026
172
Baidu logoBaidu
ERNIE 5.0 Thinking Preview
790-27 / +27Nov 2025
173
xAI logoxAI
Grok 4.1 Fast (Non-reasoning)
786-28 / +28Nov 2025
174
Amazon logoAmazon
Nova 2.0 Omni (medium)
784-27 / +27Nov 2025
175
InclusionAI logoInclusionAI
Ling 2.6 Flash
782-23 / +24Apr 2026
176
Mistral logoMistral
Mistral Medium 3.1
781-24 / +28Aug 2025
177
Alibaba logoAlibaba
Qwen3 235B A22B 2507 Instruct
781-28 / +27Jul 2025
178
xAI logoxAI
Grok 4 Fast (Non-reasoning)
778-26 / +26Sep 2025
179
OpenAI logoOpenAI
GPT-4.1
777-27 / +26Apr 2025
180
Alibaba logoAlibaba
Qwen3 VL 4B (Reasoning)
776-39 / +40Oct 2025
181
NVIDIA logoNVIDIA
Nemotron 3 Nano Omni 30B A3B Reasoning
767-28 / +27Apr 2026
182
LG AI Research logoLG AI Research
K-EXAONE (Non-reasoning)
765-27 / +25Dec 2025
183
xAI logoxAI
Grok Code Fast 1
764-27 / +27Aug 2025
184
ByteDance Seed logoByteDance Seed
Seed-OSS-36B-Instruct
760-27 / +26Aug 2025
185
NVIDIA logoNVIDIA
Nemotron Cascade 2 30B A3B
757-25 / +23Mar 2026
186
Alibaba logoAlibaba
Qwen3 235B A22B (Reasoning)
757-29 / +27Apr 2025
187
OpenAI logoOpenAI
GPT-5 nano (high)
755-27 / +25Aug 2025
188
OpenAI logoOpenAI
o3
754-29 / +29Apr 2025
189
Prime Intellect logoPrime Intellect
INTELLECT-3
752-28 / +25Nov 2025
190
OpenAI logoOpenAI
o3-mini (high)
748-27 / +28Jan 2025
191
Google logoGoogle
Gemini 2.5 Flash (Non-reasoning)
744-27 / +29May 2025
192
Alibaba logoAlibaba
Qwen3 235B A22B (Non-reasoning)
740-29 / +28Apr 2025
193
Sarvam logoSarvam
Sarvam 105B (high)
738-25 / +23Mar 2026
194
OpenAI logoOpenAI
o1
737-27 / +26Dec 2024
195
Alibaba logoAlibaba
Qwen3 Next 80B A3B (Reasoning)
726-27 / +24Sep 2025
196
Alibaba logoAlibaba
Qwen3.5 9B (Reasoning)
717-22 / +23Mar 2026
197
Alibaba logoAlibaba
Qwen3 VL 235B A22B (Reasoning)
715-26 / +27Sep 2025
198
Alibaba logoAlibaba
Qwen3 Coder 30B A3B Instruct
712-28 / +25Jul 2025
199
Anthropic logoAnthropic
Claude 3.5 Haiku
708-25 / +24Oct 2024
200
Google logoGoogle
Gemini 2.5 Flash (Reasoning)
701-30 / +29May 2025
201
Mistral logoMistral
Devstral Medium
694-27 / +26Jul 2025
202
Z AI logoZ AI
GLM-4.6V (Non-reasoning)
693-28 / +26Dec 2025
203
InclusionAI logoInclusionAI
Ring-1T
688-26 / +27Oct 2025
204
Alibaba logoAlibaba
Qwen3 VL 8B Instruct
684-37 / +41Oct 2025
205
DeepSeek logoDeepSeek
DeepSeek R1 0528 (May '25)
682-29 / +25May 2025
206
Naver logoNaver
HyperCLOVA X SEED Think (32B)
679-25 / +25Dec 2025
207
Upstage logoUpstage
Solar Pro 3
677-24 / +25Apr 2026
208
Mistral logoMistral
Magistral Small 1.2
671-26 / +26Sep 2025
209
Alibaba logoAlibaba
Qwen3 VL 8B (Reasoning)
670-27 / +26Oct 2025
210
Alibaba logoAlibaba
Qwen3.5 4B (Non-reasoning)
670-25 / +23Mar 2026
211
Alibaba logoAlibaba
Qwen3 VL 30B A3B (Reasoning)
668-38 / +37Oct 2025
212
xAI logoxAI
Grok 3
668-27 / +27Feb 2025
213
Mistral logoMistral
Magistral Medium 1
667-28 / +26Jun 2025
214
Upstage logoUpstage
Solar Open 100B (Reasoning)
666-30 / +28Dec 2025
215
Alibaba logoAlibaba
Qwen3 30B A3B 2507 (Reasoning)
664-25 / +24Jul 2025
216
Amazon logoAmazon
Nova 2.0 Pro Preview (low)
661-29 / +29Nov 2025
217
Mistral logoMistral
Ministral 3 14B
658-27 / +27Dec 2025
218
OpenAI logoOpenAI
gpt-oss-20B (high)
652-27 / +26Aug 2025
219
Alibaba logoAlibaba
Qwen3 VL 32B (Reasoning)
648-29 / +27Oct 2025
220
Korea Telecom logoKorea Telecom
Mi:dm K 2.5 Pro
643-28 / +26Dec 2025
221
Amazon logoAmazon
Nova 2.0 Lite (medium)
642-26 / +26Oct 2025
222
Mistral logoMistral
Ministral 3 8B
640-29 / +29Dec 2025
223
Alibaba logoAlibaba
Qwen3 VL 235B A22B Instruct
637-39 / +39Sep 2025
224
Mistral logoMistral
Magistral Medium 1.2
629-27 / +26Sep 2025
225
Alibaba logoAlibaba
Qwen3 Next 80B A3B Instruct
628-28 / +28Sep 2025
226
OpenAI logoOpenAI
GPT-4.1 mini
620-28 / +27Apr 2025
227
DeepSeek logoDeepSeek
DeepSeek V3.1 (Reasoning)
613-29 / +25Aug 2025
228
Z AI logoZ AI
GLM-4.6V (Reasoning)
610-30 / +28Dec 2025
229
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2 Think V2
608-27 / +26Dec 2025
230
OpenAI logoOpenAI
GPT-5 nano (medium)
595-29 / +26Aug 2025
231
Alibaba logoAlibaba
Qwen3 4B 2507 (Reasoning)
591-28 / +27Aug 2025
232
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Reasoning)
587-26 / +25Aug 2025
233
Mistral logoMistral
Mistral Medium 3
587-28 / +27May 2025
234
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (medium)
581-28 / +28Dec 2025
235
ServiceNow logoServiceNow
Apriel-v1.6-15B-Thinker
574-27 / +28Nov 2025
236
Google logoGoogle
Gemini 2.0 Flash (Feb '25)
571-27 / +26Feb 2025
237
Mistral logoMistral
Devstral Small (Jul '25)
566-29 / +27Jul 2025
238
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)
565-28 / +28Dec 2025
239
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (high)
562-30 / +27Dec 2025
240
Z AI logoZ AI
GLM-4.5-Air
561-31 / +28Jul 2025
241
OpenAI logoOpenAI
gpt-oss-20B (low)
550-27 / +25Aug 2025
242
IBM logoIBM
Granite 4.1 8B
542-26 / +28Apr 2026
243
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Reasoning)
539-23 / +24Aug 2025
244
Kimi logoKimi
Kimi K2
528-34 / +32Jul 2025
245
Nous Research logoNous Research
Hermes 4 - Llama-3.1 70B (Non-reasoning)
523-24 / +25Aug 2025
246
Alibaba logoAlibaba
Qwen3 30B A3B 2507 Instruct
517-27 / +27Jul 2025
247
Nous Research logoNous Research
Hermes 4 - Llama-3.1 405B (Non-reasoning)
512-24 / +24Aug 2025
248
Z AI logoZ AI
GLM-4.5V (Reasoning)
512-24 / +21Aug 2025
249
Alibaba logoAlibaba
Qwen3.5 4B (Reasoning)
511-29 / +29Mar 2026
250
Amazon logoAmazon
Nova 2.0 Lite (low)
510-30 / +25Oct 2025
251
Alibaba logoAlibaba
Qwen3 Coder 480B A35B Instruct
508-32 / +27Jul 2025
252
Amazon logoAmazon
Nova Premier
507-29 / +29Apr 2025
253
Alibaba logoAlibaba
Qwen3 30B A3B (Reasoning)
501-27 / +25Apr 2025
254
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Reasoning)
500-28 / +27Jul 2025
255
Allen Institute for AI logoAllen Institute for AI
Molmo2-8B
500-0 / +0Dec 2025
256
DeepSeek logoDeepSeek
DeepSeek V3.2 Speciale
500-0 / +0Dec 2025
257
Alibaba logoAlibaba
Qwen3 8B (Reasoning)
498-28 / +27Apr 2025
258
Alibaba logoAlibaba
Qwen3 VL 30B A3B Instruct
498-30 / +28Oct 2025
259
Alibaba logoAlibaba
Qwen3 Omni 30B A3B (Reasoning)
497-28 / +25Sep 2025
260
IBM logoIBM
Granite 4.1 30B
495-28 / +24Apr 2026
261
Alibaba logoAlibaba
Qwen3 32B (Reasoning)
492-28 / +26Apr 2025
262
Motif Technologies logoMotif Technologies
Motif-2-12.7B-Reasoning
485-30 / +28Dec 2025
263
Mistral logoMistral
Ministral 3 3B
485-29 / +28Dec 2025
264
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 4B
479-30 / +29Mar 2026
265
Alibaba logoAlibaba
Qwen3 14B (Reasoning)
478-27 / +29Apr 2025
266
Alibaba logoAlibaba
Qwen3 14B (Non-reasoning)
475-28 / +28Apr 2025
267
OpenAI logoOpenAI
GPT-5 mini (minimal)
471-31 / +31Aug 2025
268
Z AI logoZ AI
GLM-4.5 (Reasoning)
469-34 / +31Jul 2025
269
Alibaba logoAlibaba
Qwen3 8B (Non-reasoning)
469-26 / +27Apr 2025
270
Z AI logoZ AI
GLM-4.5V (Non-reasoning)
461-31 / +28Aug 2025
271
Upstage logoUpstage
Solar Pro 2 (Reasoning)
453-32 / +28Jul 2025
272
Upstage logoUpstage
Solar Pro 2 (Non-reasoning)
447-29 / +27Jul 2025
273
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Reasoning)
439-28 / +28Aug 2025
274
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)
439-32 / +30Sep 2025
275
Meta logoMeta
Llama 4 Maverick
437-29 / +27Apr 2025
276
InclusionAI logoInclusionAI
Ling-flash-2.0
421-31 / +29Sep 2025
277
xAI logoxAI
Grok 3 mini Reasoning (high)
421-38 / +39Feb 2025
278
DeepSeek logoDeepSeek
DeepSeek V3 (Dec '24)
410-27 / +29Dec 2024
279
DeepSeek logoDeepSeek
DeepSeek V3 0324
408-29 / +29Mar 2025
280
InclusionAI logoInclusionAI
Ling-1T
403-29 / +31Oct 2025
281
Meta logoMeta
Llama 3.3 Instruct 70B
402-33 / +28Dec 2024
282
Amazon logoAmazon
Nova Pro
389-28 / +28Dec 2024
283
OpenAI logoOpenAI
GPT-5 (minimal)
388-30 / +30Aug 2025
284
Google logoGoogle
Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)
385-28 / +30Sep 2025
285
Amazon logoAmazon
Nova 2.0 Lite (Non-reasoning)
383-30 / +31Oct 2025
286
Anthropic logoAnthropic
Claude 3 Haiku
380-26 / +22Mar 2024
287
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Non-reasoning)
379-28 / +29Jul 2025
288
OpenAI logoOpenAI
GPT-4o (Aug '24)
379-28 / +27Aug 2024
289
Trillion Labs logoTrillion Labs
Tri-21B-Think
374-25 / +25Feb 2026
290
TII UAE logoTII UAE
Falcon-H1R-7B
374-33 / +30Jan 2026
291
NVIDIA logoNVIDIA
Llama Nemotron Super 49B v1.5 (Reasoning)
370-31 / +30Jul 2025
292
MBZUAI Institute of Foundation Models logoMBZUAI Institute of Foundation Models
K2-V2 (low)
368-29 / +29Dec 2025
293
IBM logoIBM
Granite 4.1 3B
367-26 / +25Apr 2026
294
Amazon logoAmazon
Nova 2.0 Omni (low)
362-33 / +29Nov 2025
295
Sarvam logoSarvam
Sarvam 30B (high)
361-25 / +24Mar 2026
296
Allen Institute for AI logoAllen Institute for AI
Olmo 3.1 32B Instruct
360-31 / +28Jan 2026
297
Nanbeige logoNanbeige
Nanbeige4.1-3B
358-31 / +30Feb 2026
298
OpenAI logoOpenAI
GPT-4o (Nov '24)
351-26 / +23Nov 2024
299
NVIDIA logoNVIDIA
NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)
350-29 / +28Dec 2025
300
Amazon logoAmazon
Nova Lite
345-32 / +29Dec 2024
301
Alibaba logoAlibaba
Qwen3 VL 4B Instruct
345-26 / +29Oct 2025
302
IBM logoIBM
Granite 4.0 H Small
343-29 / +27Sep 2025
303
Amazon logoAmazon
Nova Micro
341-31 / +33Dec 2024
304
Mistral logoMistral
Mistral Small 3.1
341-28 / +27Mar 2025
305
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Instruct 70B
338-30 / +31Oct 2024
306
Trillion Labs logoTrillion Labs
Tri-21B-think Preview
337-33 / +30Feb 2026
307
Alibaba logoAlibaba
Qwen3 30B A3B (Non-reasoning)
333-28 / +28Apr 2025
308
LG AI Research logoLG AI Research
EXAONE 4.0 32B (Non-reasoning)
331-32 / +29Jul 2025
309
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Reasoning)
330-27 / +27Oct 2025
310
Mistral logoMistral
Mistral Large 2 (Nov '24)
326-31 / +29Nov 2024
311
Amazon logoAmazon
Nova 2.0 Pro Preview (Non-reasoning)
324-33 / +28Nov 2025
312
Alibaba logoAlibaba
Qwen3.5 2B (Reasoning)
323-24 / +23Mar 2026
313
OpenAI logoOpenAI
GPT-4.1 nano
322-29 / +29Apr 2025
314
Google logoGoogle
Gemini 2.5 Flash-Lite (Reasoning)
321-32 / +29Jun 2025
315
Alibaba logoAlibaba
Qwen3 0.6B (Reasoning)
316-31 / +29Apr 2025
316
Alibaba logoAlibaba
Qwen3 4B 2507 Instruct
308-30 / +30Aug 2025
317
Amazon logoAmazon
Nova 2.0 Omni (Non-reasoning)
306-32 / +28Nov 2025
318
Mistral logoMistral
Mistral Small 3.2
305-30 / +30Jun 2025
319
Google logoGoogle
Gemma 4 E4B (Reasoning)
305-26 / +24Apr 2026
320
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 9B V2 (Non-reasoning)
304-31 / +30Aug 2025
321
Google logoGoogle
Gemini 2.5 Flash-Lite (Non-reasoning)
303-30 / +31Jun 2025
322
Alibaba logoAlibaba
Qwen3 VL 32B Instruct
302-31 / +29Oct 2025
323
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Non-reasoning)
299-31 / +29Jul 2025
324
LG AI Research logoLG AI Research
Exaone 4.0 1.2B (Reasoning)
297-30 / +28Jul 2025
325
Alibaba logoAlibaba
Qwen3 Omni 30B A3B Instruct
297-34 / +29Sep 2025
326
IBM logoIBM
Granite 4.0 H 350M
294-32 / +26Oct 2025
327
Google logoGoogle
Gemma 4 E4B (Non-reasoning)
293-27 / +25Apr 2026
328
Meta logoMeta
Llama 3.1 Instruct 70B
288-30 / +28Jul 2024
329
OpenAI logoOpenAI
GPT-5 nano (minimal)
287-33 / +31Aug 2025
330
Google logoGoogle
Gemma 3 27B Instruct
287-31 / +28Mar 2025
331
NVIDIA logoNVIDIA
NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)
287-29 / +29Oct 2025
332
AI21 Labs logoAI21 Labs
Jamba 1.7 Large
285-29 / +31Jul 2025
333
Google logoGoogle
Gemma 3 12B Instruct
282-32 / +28Mar 2025
334
IBM logoIBM
Granite 4.0 Micro
282-30 / +27Sep 2025
335
AI21 Labs logoAI21 Labs
Jamba 1.7 Mini
280-31 / +27Jul 2025
336
Alibaba logoAlibaba
Qwen3 0.6B (Non-reasoning)
280-33 / +30Apr 2025
337
Cohere logoCohere
Command A
280-29 / +26Mar 2025
338
Meta logoMeta
Llama 3.1 Instruct 8B
279-34 / +29Jul 2024
339
Allen Institute for AI logoAllen Institute for AI
Olmo 3 7B Instruct
279-29 / +29Nov 2025
340
Alibaba logoAlibaba
Qwen3.5 0.8B (Reasoning)
279-25 / +24Mar 2026
341
Alibaba logoAlibaba
Qwen3 1.7B (Reasoning)
275-33 / +30Apr 2025
342
Meta logoMeta
Llama 4 Scout
274-30 / +30Apr 2025
343
IBM logoIBM
Granite 4.0 350M
274-33 / +29Oct 2025
344
Liquid AI logoLiquid AI
LFM2 1.2B
274-30 / +31Jul 2025
345
Google logoGoogle
Gemma 4 E2B (Reasoning)
273-26 / +25Apr 2026
346
IBM logoIBM
Granite 4.0 H 1B
272-33 / +29Oct 2025
347
Liquid AI logoLiquid AI
LFM2.5-1.2B-Instruct
268-34 / +31Jan 2026
348
OpenBMB logoOpenBMB
MiniCPM-V 4.6 1.3B
266-31 / +27May 2026
349
InclusionAI logoInclusionAI
Ling-mini-2.0
264-25 / +25Sep 2025
350
IBM logoIBM
Granite 4.0 1B
260-31 / +30Oct 2025
351
Google logoGoogle
Gemma 3 4B Instruct
258-31 / +30Mar 2025
352
Liquid AI logoLiquid AI
LFM2 8B A1B
258-34 / +29Oct 2025
353
AI21 Labs logoAI21 Labs
Jamba Reasoning 3B
258-29 / +27Oct 2025
354
StepFun logoStepFun
Step3 VL 10B
258-32 / +28Jan 2026
355
Meta logoMeta
Llama 3.1 Instruct 405B
257-30 / +30Jul 2024
356
Liquid AI logoLiquid AI
LFM2.5-1.2B-Thinking
257-32 / +28Jan 2026
357
Google logoGoogle
Gemma 4 E2B (Non-reasoning)
256-26 / +25Apr 2026
358
Alibaba logoAlibaba
Qwen3 1.7B (Non-reasoning)
256-33 / +30Apr 2025
359
DeepSeek logoDeepSeek
DeepSeek R1 (Jan '25)
252-30 / +27Jan 2025
360
Google logoGoogle
Gemma 3n E4B Instruct
246-32 / +30Jun 2025
361
NVIDIA logoNVIDIA
Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)
241-25 / +24Apr 2025
362
Liquid AI logoLiquid AI
LFM2 2.6B
239-31 / +29Sep 2025
363
Alibaba logoAlibaba
Qwen3.5 2B (Non-reasoning)
238-24 / +23Mar 2026
364
Liquid AI logoLiquid AI
LFM2 24B A2B
237-25 / +25Feb 2026
365
Alibaba logoAlibaba
Qwen3.5 0.8B (Non-reasoning)
235-25 / +26Mar 2026
366
Liquid AI logoLiquid AI
LFM2.5-VL-1.6B
234-33 / +27Jan 2026
367
Microsoft logoMicrosoft
Phi-4 Mini Instruct
233-30 / +27Feb 2024
368
IBM logoIBM
Granite 3.3 8B (Non-reasoning)
226-32 / +28Apr 2025

Example Tasks

Explore Evaluations

Artificial Analysis Intelligence IndexArtificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA LeaderboardGDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

APEX-Agents-AA Benchmark LeaderboardAPEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark LeaderboardTerminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark LeaderboardSciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark LeaderboardArtificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination BenchmarkAA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark LeaderboardIFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark LeaderboardHumanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark LeaderboardCritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

Artificial Analysis Openness IndexArtificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark LeaderboardMMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark LeaderboardGlobal-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark LeaderboardLiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark LeaderboardMATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark LeaderboardAIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark LeaderboardMMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.