GDPval-AA Leaderboard

GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

See example tasks

The GDPval gold public dataset includes 220 tasks developed by OpenAI in collaboration with industry professionals to reflect real-world complexity.

The benchmark requires models to produce diverse outputs including documents, slides, diagrams, and spreadsheets, mirroring actual work products across finance, healthcare, legal, and other professional domains.

All evaluations are conducted independently by Artificial Analysis. More information can be found on our Intelligence Benchmarking Methodology page.

Publication

View on arXiv

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek.

We introduce GDPval, a benchmark designed to evaluate AI models on real-world, economically valuable tasks across 44 occupations. The dataset encompasses 1,320 tasks derived from nine major industries contributing significantly to the U.S. GDP. These tasks were developed in collaboration with industry professionals averaging 14 years of experience, ensuring they accurately represent real-world complexities. The evaluation requires models to produce diverse outputs, including documents, slides, diagrams, and spreadsheets, mirroring actual work products. Initial results indicate that frontier AI models are approaching the quality of work produced by human experts, with models able to perform certain professional tasks approximately 100 times faster and at a fraction of the cost compared to human experts.

GDPval

2510.04374

openai/gdpval

GDPval

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) scores the highest on GDPval with a score of 1890, followed by GPT-5.5 (xhigh) with a score of 1769, and GPT-5.5 (high) with a score of 1753

GDPval-AA Elo

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis

Stirrup Agent Harness

AI Chatbot

Chatbots

GDPval-AA: AI Chatbots

Elo scores for AI chatbots tested in the GDPval-AA evaluation

AI Chatbot

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

GDPval-AA Elo · Artificial Analysis Intelligence Index

Most attractive quadrant

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

Token Usage

GDPval-AA: Output Token Usage

Output tokens used to run the evaluation

Reasoning tokens

Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Average Turns

GDPval-AA: Average Turns per Task

Average number of turns per task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

Most attractive region

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

MBZUAI Institute of Foundation Models

GDPval-AA Leaderboard


1	Anthropic	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	1890	-34 / +35	May 2026
2	OpenAI	GPT-5.5 (xhigh)	1769	-32 / +31	Apr 2026
3	OpenAI	GPT-5.5 (high)	1753	-28 / +31	Apr 2026
4	Anthropic	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	1753	-41 / +40	Apr 2026
5	Anthropic	Claude Opus 4.7 (Non-reasoning, High Effort)	1678	-27 / +28	Apr 2026
6	Anthropic	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	1676	-26 / +29	Feb 2026
7	OpenAI	GPT-5.4 (xhigh)	1674	-34 / +32	Mar 2026
8	Google	Gemini 3.5 Flash (high)	1656	-26 / +30	May 2026
9	Google	Gemini 3.5 Flash (medium)	1655	-27 / +30	May 2026
10	OpenAI	GPT-5.5 (medium)	1652	-26 / +28	Apr 2026
11	Anthropic	Claude Opus 4.6 (Adaptive Reasoning, Max Effort)	1619	-31 / +33	Feb 2026
12	Anthropic	Claude Sonnet 4.6 (Non-reasoning, High Effort)	1596	-26 / +25	Feb 2026
13	Anthropic	Claude Opus 4.6 (Non-reasoning, High Effort)	1591	-23 / +27	Feb 2026
14	Xiaomi	MiMo-V2.5-Pro	1571	-27 / +28	Apr 2026
15	DeepSeek	DeepSeek V4 Pro (Reasoning, High Effort)	1558	-29 / +31	Apr 2026
16	DeepSeek	DeepSeek V4 Pro (Reasoning, Max Effort)	1554	-29 / +29	Apr 2026
17	Xiaomi	MiMo-V2.5	1549	-25 / +26	Apr 2026
18	Alibaba	Qwen3.7 Max	1547	-27 / +28	May 2026
19	Z AI	GLM-5.1 (Reasoning)	1535	-0 / +0	Apr 2026
20	MiniMax	MiniMax-M2.7	1505	-24 / +26	Mar 2026
21	Alibaba	Qwen3.6 Max Preview	1504	-20 / +21	Apr 2026
22	OpenAI	GPT-5.4 (low)	1503	-27 / +29	Mar 2026
23	Z AI	GLM-5-Turbo	1496	-24 / +22	Mar 2026
24	xAI	Grok 4.3 (high)	1495	-25 / +23	Apr 2026
25	Z AI	GLM-5.1 (Non-reasoning)	1493	-28 / +27	Apr 2026
26	Kimi	Kimi K2.6	1481	-25 / +26	Apr 2026
27	OpenAI	GPT-5.3 Codex (xhigh)	1477	-22 / +26	Feb 2026
28	DeepSeek	DeepSeek V4 Pro (Non-reasoning)	1476	-24 / +26	Apr 2026
29	OpenAI	GPT-5.2 (xhigh)	1467	-25 / +25	Dec 2025
30	Anthropic	Claude Sonnet 4.6 (Non-reasoning, Low Effort)	1455	-25 / +25	Feb 2026
31	Anthropic	Claude Opus 4.5 (Reasoning)	1453	-24 / +25	Nov 2025
32	OpenAI	GPT-5.5 (low)	1443	-23 / +24	Apr 2026
33	Google	Gemini 3.5 Flash (minimal)	1440	-26 / +24	May 2026
34	OpenAI	GPT-5.4 mini (xhigh)	1438	-23 / +26	Mar 2026
35	Anthropic	Claude Opus 4.5 (Non-reasoning)	1420	-22 / +23	Nov 2025
36	Meta	Muse Spark	1417	-23 / +23	Apr 2026
37	DeepSeek	DeepSeek V4 Flash (Reasoning, High Effort)	1414	-25 / +26	Apr 2026
38	Xiaomi	MiMo-V2-Pro	1407	-22 / +24	Mar 2026
39	OpenAI	GPT-5.2 (medium)	1405	-23 / +23	Dec 2025
40	Alibaba	Qwen3.6 27B (Reasoning)	1404	-22 / +23	Apr 2026
41	Z AI	GLM-5 (Reasoning)	1394	-23 / +23	Feb 2026
42	DeepSeek	DeepSeek V4 Flash (Non-reasoning)	1390	-25 / +26	Apr 2026
43	DeepSeek	DeepSeek V4 Flash (Reasoning, Max Effort)	1388	-20 / +34	Apr 2026
44	Alibaba	Qwen3.6 27B (Non-reasoning)	1385	-23 / +24	Apr 2026
45	Alibaba	Qwen3.6 Plus	1351	-24 / +24	Apr 2026
46	Xiaomi	MiMo-V2-Omni-0327	1346	-24 / +23	Mar 2026
47	OpenAI	GPT-5.4 (Non-reasoning)	1342	-26 / +26	Mar 2026
48	Z AI	GLM 5V Turbo (Reasoning)	1330	-22 / +23	Apr 2026
49	Kimi	Kimi K2.6 (Non-reasoning)	1325	-26 / +29	Apr 2026
50	Z AI	GLM-5 (Non-reasoning)	1324	-21 / +23	Feb 2026
51	Google	Gemini 3 Deep Think	1324	-30 / +31	Feb 2026
52		Claude Pro - 4.5 Opus (Extended Thinking)	1319	-41 / +38	-
53	OpenAI	GPT-5.4 mini (medium)	1319	-22 / +24	Mar 2026
54	Xiaomi	MiMo-V2-Omni	1318	-23 / +24	Mar 2026
55	Anthropic	Claude 4.5 Sonnet (Reasoning)	1318	-24 / +25	Sep 2025
56	OpenAI	GPT-5.5 (Non-reasoning)	1316	-25 / +23	Apr 2026
57	Google	Gemini 3.1 Pro Preview	1314	-26 / +27	Feb 2026
58	xAI	Grok 4.3 (medium)	1312	-26 / +24	Apr 2026
59	Anthropic	Claude 4.5 Sonnet (Non-reasoning)	1307	-22 / +26	Sep 2025
60	xAI	Grok 4.3 (Non-reasoning)	1299	-24 / +25	Apr 2026
61	Alibaba	Qwen3.6 35B A3B (Reasoning)	1298	-22 / +23	Apr 2026
62	Xiaomi	MiMo-V2.5-Pro (Non-reasoning)	1296	-22 / +25	Apr 2026
63	OpenAI	GPT-5 (high)	1294	-21 / +22	Aug 2025
64	OpenAI	GPT-5.2 Codex (xhigh)	1289	-27 / +27	Dec 2025
65	Kimi	Kimi K2.5 (Reasoning)	1285	-23 / +23	Jan 2026
66	Kimi	Kimi K2.5 (Non-reasoning)	1265	-24 / +22	Jan 2026
67	Tencent	Hy3-preview (Reasoning)	1237	-25 / +23	Apr 2026
68	Tencent	Hy3-preview (Non-reasoning)	1227	-25 / +25	Apr 2026
69	OpenAI	GPT-5.1 (high)	1227	-21 / +22	Nov 2025
70	Alibaba	Qwen3.6 35B A3B (Non-reasoning)	1223	-24 / +24	Apr 2026
71	OpenAI	GPT-5.2 (Non-reasoning)	1221	-22 / +24	Dec 2025
72	Alibaba	Qwen3.5 397B A17B (Non-reasoning)	1220	-22 / +23	Feb 2026
73	OpenAI	GPT-5 Codex (high)	1215	-24 / +22	Sep 2025
74	Google	Gemini 3 Flash Preview (Reasoning)	1204	-24 / +24	Dec 2025
75	OpenAI	GPT-5.4 nano (medium)	1200	-22 / +23	Mar 2026
76	DeepSeek	DeepSeek V3.2 (Reasoning)	1197	-24 / +22	Dec 2025
77	OpenAI	GPT-5.4 nano (xhigh)	1194	-24 / +25	Mar 2026
78	OpenAI	GPT-5.1 Codex (high)	1192	-25 / +25	Nov 2025
79	Alibaba	Qwen3.5 397B A17B (Reasoning)	1190	-23 / +22	Feb 2026
80	Z AI	GLM-4.7 (Reasoning)	1186	-22 / +23	Dec 2025
81	Google	Gemini 3 Pro Preview (high)	1185	-23 / +23	Nov 2025
82	Alibaba	Qwen3.5 Omni Plus	1185	-23 / +24	Mar 2026
83	OpenAI	GPT-5 mini (high)	1185	-22 / +22	Aug 2025
84	Z AI	GLM-4.7 (Non-reasoning)	1177	-22 / +23	Dec 2025
85	MiniMax	MiniMax-M2.5	1176	-24 / +22	Feb 2026
86	Anthropic	Claude 4.5 Haiku (Reasoning)	1171	-24 / +25	Oct 2025
87	xAI	Grok 4.20 0309 v2 (Reasoning)	1169	-22 / +24	Apr 2026
88	Mistral	Mistral Medium 3.5	1168	-25 / +24	Apr 2026
89	Google	Gemini 3 Pro Preview (low)	1168	-25 / +28	Nov 2025
90	Alibaba	Qwen3.5 27B (Non-reasoning)	1162	-22 / +22	Feb 2026
91	Alibaba	Qwen3.5 27B (Reasoning)	1160	-22 / +23	Feb 2026
92		ChatGPT Plus - 5.1 Thinking (Extended Thinking)	1149	-41 / +45	-
93	OpenAI	GPT-5 (low)	1148	-23 / +23	Aug 2025
94	OpenAI	GPT-5.5 Instant (May 2026)	1143	-23 / +23	May 2026
95	Alibaba	Qwen3 Max Thinking	1138	-23 / +24	Jan 2026
96	Anthropic	Claude 4.5 Haiku (Non-reasoning)	1136	-25 / +26	Oct 2025
97	Anthropic	Claude 4 Sonnet (Reasoning)	1134	-24 / +26	May 2025
98	Anthropic	Claude 4 Sonnet (Non-reasoning)	1126	-24 / +24	May 2025
99	xAI	Grok 4.3 (low)	1126	-24 / +22	Apr 2026
100	InclusionAI	Ring-2.6-1T	1125	-23 / +26	May 2026
101	KwaiKAT	KAT Coder Pro V2	1120	-24 / +23	Mar 2026
102	Google	Gemini 3 Flash Preview (Non-reasoning)	1116	-26 / +26	Dec 2025
103	Alibaba	Qwen3.5 122B A10B (Reasoning)	1116	-23 / +22	Feb 2026
104	Google	Gemma 4 31B (Reasoning)	1113	-22 / +23	Apr 2026
105	Alibaba	Qwen3.5 122B A10B (Non-reasoning)	1110	-21 / +24	Feb 2026
106	MiniMax	MiniMax-M2.1	1090	-25 / +23	Dec 2025
107	Xiaomi	MiMo-V2-Flash (Reasoning)	1081	-23 / +24	Dec 2025
108	DeepSeek	DeepSeek V3.1 (Non-reasoning)	1080	-25 / +23	Aug 2025
109	China Mobile	JT-35B-Flash	1078	-23 / +22	May 2026
110	Google	Gemini 2.5 Flash Preview (Sep '25) (Reasoning)	1072	-23 / +25	Sep 2025
111	StepFun	Step 3.5 Flash 2603	1070	-23 / +25	Apr 2026
112	DeepSeek	DeepSeek V3.2 Exp (Non-reasoning)	1069	-25 / +24	Sep 2025
113	Xiaomi	MiMo-V2-Flash (Non-reasoning)	1061	-23 / +24	Dec 2025
114	StepFun	Step 3.5 Flash	1054	-26 / +27	Feb 2026
115	OpenAI	GPT-5.1 Codex mini (high)	1053	-27 / +23	Nov 2025
116	Anthropic	Claude 3.7 Sonnet (Reasoning)	1049	-21 / +24	Feb 2025
117	Alibaba	Qwen3.5 35B A3B (Non-reasoning)	1048	-21 / +23	Feb 2026
118	Anthropic	Claude 3.7 Sonnet (Non-reasoning)	1046	-24 / +23	Feb 2025
119	xAI	Grok 4.20 0309 (Reasoning)	1046	-21 / +23	Mar 2026
120	InclusionAI	Ling-2.6-1T	1045	-23 / +23	Apr 2026
121	xAI	Grok 4.1 Fast (Reasoning)	1045	-23 / +24	Nov 2025
122	Xiaomi	MiMo-V2-Flash (Feb 2026)	1045	-23 / +24	Dec 2025
123	xAI	Grok 4.20 0309 v2 (Non-reasoning)	1039	-28 / +27	Apr 2026
124	Alibaba	Qwen3 Max	1038	-22 / +23	Sep 2025
125	MiniMax	MiniMax-M2	1032	-26 / +25	Oct 2025
126		Perplexity Pro - Labs	1032	-41 / +39	-
127	Z AI	GLM-4.6 (Reasoning)	1030	-29 / +29	Sep 2025
128	Google	Gemma 4 26B A4B (Reasoning)	1014	-23 / +23	Apr 2026
129	xAI	Grok 4 Fast (Reasoning)	1014	-22 / +24	Sep 2025
130	OpenAI	o4-mini (high)	1006	-25 / +23	Apr 2025
131	Google	Gemma 4 31B (Non-reasoning)	1005	-23 / +22	Apr 2026
132	OpenAI	GPT-5.4 mini (Non-Reasoning)	1004	-23 / +23	Mar 2026
133	NVIDIA	NVIDIA Nemotron 3 Super 120B A12B (Reasoning)	1003	-22 / +21	Mar 2026
134	DeepSeek	DeepSeek V3.1 Terminus (Reasoning)	1003	-26 / +28	Sep 2025
135	OpenAI	GPT-5 mini (medium)	1002	-27 / +26	Aug 2025
136	DeepSeek	DeepSeek V3.2 Exp (Reasoning)	1001	-24 / +24	Sep 2025
137	OpenAI	GPT-5.1 (Non-reasoning)	1000	-0 / +0	Nov 2025
138	OpenAI	GPT-5 (medium)	997	-25 / +26	Aug 2025
139	MiniMax	MiniMax M1 80k	996	-24 / +24	Jun 2025
140	Kimi	Kimi K2 Thinking	994	-23 / +24	Nov 2025
141	xAI	Grok 4	990	-23 / +23	Jul 2025
142	ByteDance Seed	Doubao Seed Code	986	-26 / +26	Nov 2025
143	Z AI	GLM-4.6 (Non-reasoning)	984	-26 / +26	Sep 2025
144	DeepSeek	DeepSeek V3.1 Terminus (Non-reasoning)	976	-22 / +25	Sep 2025
145	Amazon	Nova 2.0 Pro Preview (medium)	973	-27 / +23	Nov 2025
146		Google AI Pro - Thinking with 3 Pro	972	-43 / +43	-
147	Inception	Mercury 2	958	-23 / +21	Feb 2026
148	Google	Gemma 4 26B A4B (Non-reasoning)	948	-25 / +23	Apr 2026
149	Alibaba	Qwen3 Max Thinking (Preview)	947	-25 / +25	Nov 2025
150	OpenAI	gpt-oss-120b (high)	947	-28 / +27	Aug 2025
151	OpenAI	GPT-5.4 nano (Non-Reasoning)	941	-30 / +32	Mar 2026
152	Google	Gemini 3.1 Flash-Lite	926	-23 / +22	Mar 2026
153	Cohere	Command A+	919	-25 / +24	May 2026
154	Google	Gemini 2.5 Pro	917	-24 / +25	Jun 2025
155	Alibaba	Qwen3 Coder Next	914	-25 / +26	Feb 2026
156	xAI	Grok 4.20 0309 (Non-reasoning)	909	-24 / +25	Mar 2026
157	Alibaba	Qwen3.5 35B A3B (Reasoning)	907	-23 / +23	Feb 2026
158	Alibaba	Qwen3.5 Omni Flash	897	-24 / +23	Mar 2026
159		SuperGrok - Grok 4	882	-46 / +40	-
160	DeepSeek	DeepSeek V3.2 (Non-reasoning)	877	-29 / +26	Dec 2025
161	Arcee AI	Trinity Large Thinking	865	-23 / +23	Apr 2026
162	Kimi	Kimi K2 0905	864	-29 / +28	Sep 2025
163	Mistral	Mistral Large 3	863	-25 / +23	Dec 2025
164	Mistral	Mistral Small 4 (Reasoning)	861	-23 / +23	Mar 2026
165	Mistral	Devstral 2	856	-25 / +25	Dec 2025
166	Google	Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)	853	-28 / +26	Sep 2025
167	Amazon	Nova 2.0 Lite (high)	853	-23 / +23	Oct 2025
168	Mistral	Mistral Small 4 (Non-reasoning)	845	-23 / +23	Mar 2026
169	Alibaba	Qwen3.5 9B (Non-reasoning)	844	-22 / +22	Mar 2026
170	LongCat	LongCat Flash Lite	838	-27 / +25	Jan 2026
171	Z AI	GLM-4.7-Flash (Reasoning)	837	-27 / +26	Jan 2026
172	China Mobile	JT-MINI	831	-24 / +24	Apr 2026
173	Mistral	Devstral Small (May '25)	829	-27 / +26	May 2025
174	OpenAI	gpt-oss-120b (low)	829	-24 / +23	Aug 2025
175	LG AI Research	K-EXAONE (Reasoning)	826	-26 / +26	Dec 2025
176	Mistral	Devstral Small 2	820	-24 / +26	Dec 2025
177	Alibaba	Qwen3 235B A22B 2507 (Reasoning)	820	-24 / +22	Jul 2025
178	KwaiKAT	KAT-Coder-Pro V1	818	-26 / +24	Nov 2025
179	Alibaba	Qwen3 Max (Preview)	817	-24 / +22	Sep 2025
180	LG AI Research	EXAONE 4.5 33B	814	-25 / +26	Apr 2026
181	Z AI	GLM-4.7-Flash (Non-reasoning)	802	-41 / +35	Jan 2026
182	Baidu	ERNIE 5.0 Thinking Preview	789	-28 / +24	Nov 2025
183	xAI	Grok 4.1 Fast (Non-reasoning)	784	-28 / +26	Nov 2025
184	Amazon	Nova 2.0 Omni (medium)	784	-26 / +25	Nov 2025
185	InclusionAI	Ling 2.6 Flash	782	-23 / +22	Apr 2026
186	Mistral	Mistral Medium 3.1	781	-25 / +27	Aug 2025
187	xAI	Grok 4 Fast (Non-reasoning)	779	-25 / +26	Sep 2025
188	Alibaba	Qwen3 235B A22B 2507 Instruct	778	-27 / +27	Jul 2025
189	OpenAI	GPT-4.1	777	-27 / +27	Apr 2025
190	Alibaba	Qwen3 VL 4B (Reasoning)	776	-39 / +40	Oct 2025
191	NVIDIA	Nemotron 3 Nano Omni 30B A3B Reasoning	764	-26 / +26	Apr 2026
192	xAI	Grok Code Fast 1	763	-26 / +26	Aug 2025
193	LG AI Research	K-EXAONE (Non-reasoning)	763	-26 / +23	Dec 2025
194	ByteDance Seed	Seed-OSS-36B-Instruct	759	-24 / +25	Aug 2025
195	Alibaba	Qwen3 235B A22B (Reasoning)	756	-29 / +27	Apr 2025
196	NVIDIA	Nemotron Cascade 2 30B A3B	756	-24 / +25	Mar 2026
197	OpenAI	GPT-5 nano (high)	755	-25 / +24	Aug 2025
198	OpenAI	o3	754	-29 / +30	Apr 2025
199	Prime Intellect	INTELLECT-3	750	-26 / +24	Nov 2025
200	OpenAI	o3-mini (high)	747	-27 / +28	Jan 2025
201	Google	Gemini 2.5 Flash (Non-reasoning)	741	-28 / +26	May 2025
202	Alibaba	Qwen3 235B A22B (Non-reasoning)	739	-27 / +26	Apr 2025
203	Sarvam	Sarvam 105B (high)	738	-23 / +23	Mar 2026
204	OpenAI	o1	736	-28 / +27	Dec 2024
205	Alibaba	Qwen3 Next 80B A3B (Reasoning)	726	-24 / +26	Sep 2025
206	Alibaba	Qwen3.5 9B (Reasoning)	715	-23 / +22	Mar 2026
207	Alibaba	Qwen3 VL 235B A22B (Reasoning)	713	-24 / +25	Sep 2025
208	Alibaba	Qwen3 Coder 30B A3B Instruct	710	-24 / +25	Jul 2025
209	Anthropic	Claude 3.5 Haiku	708	-26 / +24	Oct 2024
210	Google	Gemini 2.5 Flash (Reasoning)	699	-31 / +28	May 2025
211	Z AI	GLM-4.6V (Non-reasoning)	692	-28 / +27	Dec 2025
212	Mistral	Devstral Medium	690	-27 / +26	Jul 2025
213	InclusionAI	Ring-1T	687	-27 / +28	Oct 2025
214	Alibaba	Qwen3 VL 8B Instruct	683	-37 / +38	Oct 2025
215	DeepSeek	DeepSeek R1 0528 (May '25)	681	-28 / +26	May 2025
216	Naver	HyperCLOVA X SEED Think (32B)	681	-26 / +27	Dec 2025
217	Upstage	Solar Pro 3	675	-23 / +23	Apr 2026
218	Mistral	Magistral Small 1.2	670	-26 / +24	Sep 2025
219	Alibaba	Qwen3.5 4B (Non-reasoning)	669	-24 / +22	Mar 2026
220	Alibaba	Qwen3 VL 8B (Reasoning)	669	-29 / +30	Oct 2025
221	Alibaba	Qwen3 VL 30B A3B (Reasoning)	667	-38 / +37	Oct 2025
222	xAI	Grok 3	667	-26 / +28	Feb 2025
223	Mistral	Magistral Medium 1	666	-27 / +26	Jun 2025
224	Upstage	Solar Open 100B (Reasoning)	665	-28 / +28	Dec 2025
225	Alibaba	Qwen3 30B A3B 2507 (Reasoning)	662	-25 / +25	Jul 2025
226	Amazon	Nova 2.0 Pro Preview (low)	660	-28 / +25	Nov 2025
227	Mistral	Ministral 3 14B	656	-25 / +26	Dec 2025
228	OpenAI	gpt-oss-20B (high)	651	-26 / +24	Aug 2025
229	Alibaba	Qwen3 VL 32B (Reasoning)	647	-26 / +26	Oct 2025
230	Amazon	Nova 2.0 Lite (medium)	644	-26 / +28	Oct 2025
231	Korea Telecom	Mi:dm K 2.5 Pro	642	-28 / +26	Dec 2025
232	Mistral	Ministral 3 8B	639	-27 / +27	Dec 2025
233	Alibaba	Qwen3 VL 235B A22B Instruct	636	-38 / +37	Sep 2025
234	Mistral	Magistral Medium 1.2	628	-30 / +28	Sep 2025
235	Alibaba	Qwen3 Next 80B A3B Instruct	627	-28 / +29	Sep 2025
236	OpenAI	GPT-4.1 mini	621	-27 / +25	Apr 2025
237	DeepSeek	DeepSeek V3.1 (Reasoning)	613	-28 / +28	Aug 2025
238	Z AI	GLM-4.6V (Reasoning)	609	-29 / +28	Dec 2025
239	MBZUAI Institute of Foundation Models	K2 Think V2	608	-27 / +24	Dec 2025
240	OpenAI	GPT-5 nano (medium)	594	-26 / +28	Aug 2025
241	Alibaba	Qwen3 4B 2507 (Reasoning)	590	-27 / +27	Aug 2025
242	Mistral	Mistral Medium 3	586	-27 / +28	May 2025
243	Nous Research	Hermes 4 - Llama-3.1 405B (Reasoning)	586	-24 / +24	Aug 2025
244	MBZUAI Institute of Foundation Models	K2-V2 (medium)	582	-27 / +26	Dec 2025
245	ServiceNow	Apriel-v1.6-15B-Thinker	574	-27 / +27	Nov 2025
246	Google	Gemini 2.0 Flash (Feb '25)	568	-26 / +26	Feb 2025
247	Mistral	Devstral Small (Jul '25)	565	-28 / +27	Jul 2025
248	NVIDIA	NVIDIA Nemotron 3 Nano 30B A3B (Reasoning)	565	-26 / +25	Dec 2025
249	MBZUAI Institute of Foundation Models	K2-V2 (high)	561	-27 / +26	Dec 2025
250	Z AI	GLM-4.5-Air	560	-29 / +29	Jul 2025
251	OpenAI	gpt-oss-20B (low)	550	-29 / +25	Aug 2025
252	IBM	Granite 4.1 8B	543	-25 / +24	Apr 2026
253	Nous Research	Hermes 4 - Llama-3.1 70B (Reasoning)	538	-24 / +22	Aug 2025
254	Kimi	Kimi K2	527	-31 / +34	Jul 2025
255	Nous Research	Hermes 4 - Llama-3.1 70B (Non-reasoning)	523	-26 / +25	Aug 2025
256	Alibaba	Qwen3 30B A3B 2507 Instruct	517	-28 / +28	Jul 2025
257	Z AI	GLM-4.5V (Reasoning)	511	-23 / +22	Aug 2025
258	Alibaba	Qwen3.5 4B (Reasoning)	510	-28 / +28	Mar 2026
259	Nous Research	Hermes 4 - Llama-3.1 405B (Non-reasoning)	510	-24 / +23	Aug 2025
260	Amazon	Nova 2.0 Lite (low)	507	-30 / +25	Oct 2025
261	Alibaba	Qwen3 Coder 480B A35B Instruct	507	-31 / +28	Jul 2025
262	Amazon	Nova Premier	506	-28 / +30	Apr 2025
263	Alibaba	Qwen3 30B A3B (Reasoning)	505	-27 / +27	Apr 2025
264	LG AI Research	EXAONE 4.0 32B (Reasoning)	502	-27 / +26	Jul 2025
265	Allen Institute for AI	Molmo2-8B	500	-0 / +0	Dec 2025
266	DeepSeek	DeepSeek V3.2 Speciale	500	-0 / +0	Dec 2025
267	Alibaba	Qwen3 Omni 30B A3B (Reasoning)	497	-26 / +24	Sep 2025
268	Alibaba	Qwen3 8B (Reasoning)	497	-26 / +27	Apr 2025
269	IBM	Granite 4.1 30B	496	-25 / +23	Apr 2026
270	Alibaba	Qwen3 VL 30B A3B Instruct	496	-29 / +25	Oct 2025
271	Alibaba	Qwen3 32B (Reasoning)	491	-27 / +26	Apr 2025
272	Motif Technologies	Motif-2-12.7B-Reasoning	484	-31 / +29	Dec 2025
273	Mistral	Ministral 3 3B	484	-30 / +28	Dec 2025
274	NVIDIA	NVIDIA Nemotron 3 Nano 4B	479	-29 / +26	Mar 2026
275	Alibaba	Qwen3 14B (Reasoning)	478	-27 / +26	Apr 2025
276	Alibaba	Qwen3 14B (Non-reasoning)	473	-27 / +25	Apr 2025
277	Alibaba	Qwen3 8B (Non-reasoning)	472	-26 / +24	Apr 2025
278	OpenAI	GPT-5 mini (minimal)	472	-29 / +28	Aug 2025
279	Z AI	GLM-4.5 (Reasoning)	468	-34 / +30	Jul 2025
280	Z AI	GLM-4.5V (Non-reasoning)	460	-28 / +27	Aug 2025
281	Upstage	Solar Pro 2 (Reasoning)	451	-26 / +26	Jul 2025
282	Upstage	Solar Pro 2 (Non-reasoning)	446	-28 / +27	Jul 2025
283	NVIDIA	NVIDIA Nemotron Nano 9B V2 (Reasoning)	439	-28 / +27	Aug 2025
284	Google	Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning)	437	-32 / +30	Sep 2025
285	Meta	Llama 4 Maverick	437	-28 / +26	Apr 2025
286	InclusionAI	Ling-flash-2.0	420	-30 / +27	Sep 2025
287	xAI	Grok 3 mini Reasoning (high)	420	-38 / +37	Feb 2025
288	DeepSeek	DeepSeek V3 (Dec '24)	410	-30 / +27	Dec 2024
289	DeepSeek	DeepSeek V3 0324	406	-32 / +29	Mar 2025
290	Meta	Llama 3.3 Instruct 70B	401	-29 / +28	Dec 2024
291	InclusionAI	Ling-1T	401	-28 / +27	Oct 2025
292	Amazon	Nova Pro	388	-28 / +28	Dec 2024
293	OpenAI	GPT-5 (minimal)	386	-31 / +27	Aug 2025
294	Google	Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning)	382	-29 / +29	Sep 2025
295	Amazon	Nova 2.0 Lite (Non-reasoning)	381	-29 / +29	Oct 2025
296	NVIDIA	Llama Nemotron Super 49B v1.5 (Non-reasoning)	380	-27 / +26	Jul 2025
297	Anthropic	Claude 3 Haiku	379	-27 / +25	Mar 2024
298	OpenAI	GPT-4o (Aug '24)	378	-29 / +27	Aug 2024
299	Trillion Labs	Tri-21B-Think	373	-26 / +23	Feb 2026
300	TII UAE	Falcon-H1R-7B	373	-32 / +29	Jan 2026
301	NVIDIA	Llama Nemotron Super 49B v1.5 (Reasoning)	368	-31 / +28	Jul 2025
302	MBZUAI Institute of Foundation Models	K2-V2 (low)	366	-29 / +28	Dec 2025
303	IBM	Granite 4.1 3B	366	-24 / +24	Apr 2026
304	Amazon	Nova 2.0 Omni (low)	360	-32 / +30	Nov 2025
305	Sarvam	Sarvam 30B (high)	359	-26 / +23	Mar 2026
306	Allen Institute for AI	Olmo 3.1 32B Instruct	357	-31 / +29	Jan 2026
307	Nanbeige	Nanbeige4.1-3B	357	-30 / +30	Feb 2026
308	OpenAI	GPT-4o (Nov '24)	349	-25 / +23	Nov 2024
309	NVIDIA	NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning)	348	-29 / +28	Dec 2025
310	IBM	Granite 4.0 H Small	344	-29 / +28	Sep 2025
311	Alibaba	Qwen3 VL 4B Instruct	344	-27 / +26	Oct 2025
312	Amazon	Nova Lite	344	-30 / +30	Dec 2024
313	Amazon	Nova Micro	340	-29 / +28	Dec 2024
314	Mistral	Mistral Small 3.1	337	-30 / +28	Mar 2025
315	Trillion Labs	Tri-21B-think Preview	337	-33 / +30	Feb 2026
316	NVIDIA	Llama 3.1 Nemotron Instruct 70B	337	-30 / +26	Oct 2024
317	Alibaba	Qwen3 30B A3B (Non-reasoning)	331	-29 / +27	Apr 2025
318	LG AI Research	EXAONE 4.0 32B (Non-reasoning)	329	-33 / +28	Jul 2025
319	NVIDIA	NVIDIA Nemotron Nano 12B v2 VL (Reasoning)	328	-28 / +29	Oct 2025
320	Mistral	Mistral Large 2 (Nov '24)	324	-33 / +30	Nov 2024
321	Alibaba	Qwen3.5 2B (Reasoning)	321	-24 / +22	Mar 2026
322	Google	Gemini 2.5 Flash-Lite (Reasoning)	320	-29 / +30	Jun 2025
323	Amazon	Nova 2.0 Pro Preview (Non-reasoning)	319	-30 / +26	Nov 2025
324	OpenAI	GPT-4.1 nano	318	-29 / +27	Apr 2025
325	Alibaba	Qwen3 0.6B (Reasoning)	315	-29 / +29	Apr 2025
326	Alibaba	Qwen3 4B 2507 Instruct	306	-32 / +29	Aug 2025
327	Amazon	Nova 2.0 Omni (Non-reasoning)	305	-31 / +27	Nov 2025
328	Google	Gemini 2.5 Flash-Lite (Non-reasoning)	305	-31 / +29	Jun 2025
329	Mistral	Mistral Small 3.2	304	-30 / +31	Jun 2025
330	NVIDIA	NVIDIA Nemotron Nano 9B V2 (Non-reasoning)	303	-31 / +30	Aug 2025
331	Google	Gemma 4 E4B (Reasoning)	303	-25 / +25	Apr 2026
332	Alibaba	Qwen3 VL 32B Instruct	301	-31 / +30	Oct 2025
333	Alibaba	Qwen3 Omni 30B A3B Instruct	296	-32 / +33	Sep 2025
334	LG AI Research	Exaone 4.0 1.2B (Non-reasoning)	295	-27 / +29	Jul 2025
335	LG AI Research	Exaone 4.0 1.2B (Reasoning)	294	-28 / +28	Jul 2025
336	Google	Gemma 4 E4B (Non-reasoning)	293	-28 / +27	Apr 2026
337	IBM	Granite 4.0 H 350M	292	-28 / +30	Oct 2025
338	Google	Gemma 3 27B Instruct	285	-30 / +29	Mar 2025
339	NVIDIA	NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning)	285	-28 / +28	Oct 2025
340	AI21 Labs	Jamba 1.7 Large	284	-34 / +28	Jul 2025
341	Meta	Llama 3.1 Instruct 70B	284	-30 / +27	Jul 2024
342	OpenAI	GPT-5 nano (minimal)	282	-30 / +28	Aug 2025
343	Google	Gemma 3 12B Instruct	281	-30 / +29	Mar 2025
344	Alibaba	Qwen3 0.6B (Non-reasoning)	279	-35 / +31	Apr 2025
345	Meta	Llama 3.1 Instruct 8B	278	-31 / +31	Jul 2024
346	IBM	Granite 4.0 Micro	278	-32 / +27	Sep 2025
347	Allen Institute for AI	Olmo 3 7B Instruct	277	-30 / +27	Nov 2025
348	Alibaba	Qwen3.5 0.8B (Reasoning)	277	-24 / +24	Mar 2026
349	AI21 Labs	Jamba 1.7 Mini	276	-29 / +30	Jul 2025
350	Alibaba	Qwen3 1.7B (Reasoning)	275	-29 / +28	Apr 2025
351	Cohere	Command A	275	-31 / +29	Mar 2025
352	Meta	Llama 4 Scout	272	-30 / +29	Apr 2025
353	Liquid AI	LFM2 1.2B	271	-31 / +29	Jul 2025
354	Google	Gemma 4 E2B (Reasoning)	270	-26 / +23	Apr 2026
355	IBM	Granite 4.0 H 1B	268	-29 / +29	Oct 2025
356	IBM	Granite 4.0 350M	268	-31 / +30	Oct 2025
357	Liquid AI	LFM2.5-1.2B-Instruct	266	-33 / +31	Jan 2026
358	InclusionAI	Ling-mini-2.0	262	-25 / +23	Sep 2025
359	Liquid AI	LFM2 8B A1B	259	-32 / +30	Oct 2025
360	StepFun	Step3 VL 10B	259	-30 / +28	Jan 2026
361	IBM	Granite 4.0 1B	259	-32 / +32	Oct 2025
362	OpenBMB	MiniCPM-V 4.6 1.3B	259	-33 / +29	May 2026
363	Google	Gemma 3 4B Instruct	256	-31 / +29	Mar 2025
364	Meta	Llama 3.1 Instruct 405B	256	-33 / +30	Jul 2024
365	AI21 Labs	Jamba Reasoning 3B	255	-29 / +30	Oct 2025
366	Alibaba	Qwen3 1.7B (Non-reasoning)	254	-30 / +27	Apr 2025
367	Google	Gemma 4 E2B (Non-reasoning)	253	-26 / +26	Apr 2026
368	Liquid AI	LFM2.5-1.2B-Thinking	253	-31 / +28	Jan 2026
369	DeepSeek	DeepSeek R1 (Jan '25)	249	-29 / +30	Jan 2025
370	Google	Gemma 3n E4B Instruct	244	-31 / +31	Jun 2025
371	NVIDIA	Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)	239	-24 / +24	Apr 2025
372	Alibaba	Qwen3.5 2B (Non-reasoning)	239	-26 / +21	Mar 2026
373	Liquid AI	LFM2 24B A2B	236	-26 / +24	Feb 2026
374	Liquid AI	LFM2 2.6B	236	-29 / +28	Sep 2025
375	Alibaba	Qwen3.5 0.8B (Non-reasoning)	234	-25 / +26	Mar 2026
376	Liquid AI	LFM2.5-VL-1.6B	233	-36 / +30	Jan 2026
377	OpenBMB	MiniCPM5-1B (Non-reasoning)	232	-28 / +28	May 2026
378	Microsoft	Phi-4 Mini Instruct	229	-28 / +29	Feb 2024
379	IBM	Granite 3.3 8B (Non-reasoning)	225	-30 / +26	Apr 2025

Example Tasks

Frequently Asked Questions

GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.

GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,890 among models with published GDPval-AA results. View model

GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.

Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.

Explore Evaluations

Artificial Analysis Intelligence Index

A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

GDPval-AA Leaderboard

APEX-Agents-AA Benchmark Leaderboard

Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.

𝜏²-Bench Telecom Benchmark Leaderboard

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Terminal-Bench Hard Benchmark Leaderboard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

SciCode Benchmark Leaderboard

A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Artificial Analysis Long Context Reasoning Benchmark Leaderboard

A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

AA-Omniscience: Knowledge and Hallucination Benchmark

A benchmark measuring factual recall and hallucination across various economically relevant domains.

IFBench Benchmark Leaderboard

A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.

Humanity's Last Exam Benchmark Leaderboard

A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.

GPQA Diamond Benchmark Leaderboard

The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.

CritPt Benchmark Leaderboard

A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.

ITBench-AA Benchmark Leaderboard

Artificial Analysis' implementation of IBM's ITBench benchmark, testing AI agents on Kubernetes incident root-cause analysis from offline incident snapshots. The agent inspects alerts, events, traces, and topology and identifies the contributing-factor entities (deployments, pods, namespaces, network policies, etc.) responsible for the failure.

Artificial Analysis Openness Index

A composite measure providing an industry standard to communicate model openness for users and developers.

MMLU-Pro Benchmark Leaderboard

An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.

Global-MMLU-Lite Benchmark Leaderboard

A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.

LiveCodeBench Benchmark Leaderboard

A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.

MATH-500 Benchmark Leaderboard

A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.

AIME 2025 Benchmark Leaderboard

All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.

MMMU-Pro Benchmark Leaderboard

An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.

GDPval-AA Leaderboard

Background

Methodology

Publication

Abstract

Related links

GDPval

GDPval-AA Elo

GDPval-AA Leaderboard

Chatbots

GDPval-AA: AI Chatbots

Score Comparisons

GDPval-AA: Elo vs. Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index

Token Usage

GDPval-AA: Output Token Usage

Evaluation Token Usage

Average Turns

GDPval-AA: Average Turns per Task

Score vs. Release Date

GDPval-AA: Elo vs. Release Date

GDPval-AA Leaderboard

Example Tasks

Frequently Asked Questions

What is GDPval-AA?

How does GDPval-AA decide which model did better?

Which AI model has the highest GDPval-AA score?

What kinds of tasks are included in GDPval-AA?

How is GDPval-AA different from standard AI benchmarks?

Explore Evaluations