GDPval-AA Leaderboard
Publication
View on arXivGDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
GDPval
GDPval-AA Elo
GDPval-AA Leaderboard
Chatbots
GDPval-AA: AI Chatbots
Score Comparisons
GDPval-AA: Elo vs. Artificial Analysis Intelligence Index
Token Usage
GDPval-AA: Token Usage
Cost
GDPval-AA: Cost Breakdown
Average Turns
GDPval-AA: Average Turns per Task
Score vs. Release Date
GDPval-AA: Elo vs. Release Date
GDPval-AA Leaderboard
| 1 | GPT-5.5 (xhigh) | 1769 | -32 / +31 | Apr 2026 | |
| 2 | GPT-5.5 (high) | 1753 | -28 / +32 | Apr 2026 | |
| 3 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | 1753 | -41 / +40 | Apr 2026 | |
| 4 | Claude Opus 4.7 (Non-reasoning, High Effort) | 1683 | -26 / +28 | Apr 2026 | |
| 5 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | 1676 | -26 / +29 | Feb 2026 | |
| 6 | GPT-5.4 (xhigh) | 1674 | -34 / +32 | Mar 2026 | |
| 7 | Gemini 3.5 Flash (high) | 1656 | -26 / +30 | May 2026 | |
| 8 | GPT-5.5 (medium) | 1654 | -26 / +27 | Apr 2026 | |
| 9 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | 1619 | -31 / +33 | Feb 2026 | |
| 10 | Claude Sonnet 4.6 (Non-reasoning, High Effort) | 1595 | -26 / +26 | Feb 2026 | |
| 11 | Claude Opus 4.6 (Non-reasoning, High Effort) | 1590 | -25 / +27 | Feb 2026 | |
| 12 | MiMo-V2.5-Pro | 1571 | -27 / +28 | Apr 2026 | |
| 13 | DeepSeek V4 Pro (Reasoning, High Effort) | 1558 | -29 / +31 | Apr 2026 | |
| 14 | DeepSeek V4 Pro (Reasoning, Max Effort) | 1554 | -29 / +29 | Apr 2026 | |
| 15 | MiMo-V2.5 | 1553 | -25 / +28 | Apr 2026 | |
| 16 | Qwen3.7 Max | 1544 | -24 / +25 | May 2026 | |
| 17 | GLM-5.1 (Reasoning) | 1535 | -0 / +0 | Apr 2026 | |
| 18 | MiniMax-M2.7 | 1505 | -24 / +26 | Mar 2026 | |
| 19 | Qwen3.6 Max Preview | 1504 | -20 / +21 | Apr 2026 | |
| 20 | GPT-5.4 (low) | 1503 | -27 / +29 | Mar 2026 | |
| 21 | GLM-5-Turbo | 1497 | -24 / +25 | Mar 2026 | |
| 22 | Grok 4.3 (high) | 1495 | -25 / +23 | Apr 2026 | |
| 23 | GLM-5.1 (Non-reasoning) | 1494 | -26 / +29 | Apr 2026 | |
| 24 | Kimi K2.6 | 1481 | -25 / +26 | Apr 2026 | |
| 25 | GPT-5.3 Codex (xhigh) | 1479 | -25 / +26 | Feb 2026 | |
| 26 | DeepSeek V4 Pro (Non-reasoning) | 1478 | -26 / +26 | Apr 2026 | |
| 27 | GPT-5.2 (xhigh) | 1467 | -25 / +26 | Dec 2025 | |
| 28 | Claude Sonnet 4.6 (Non-reasoning, Low Effort) | 1455 | -25 / +23 | Feb 2026 | |
| 29 | Claude Opus 4.5 (Reasoning) | 1451 | -26 / +27 | Nov 2025 | |
| 30 | Gemini 3.5 Flash (minimal) | 1443 | -25 / +27 | May 2026 | |
| 31 | GPT-5.5 (low) | 1442 | -26 / +23 | Apr 2026 | |
| 32 | GPT-5.4 mini (xhigh) | 1438 | -23 / +26 | Mar 2026 | |
| 33 | Claude Opus 4.5 (Non-reasoning) | 1419 | -22 / +22 | Nov 2025 | |
| 34 | Muse Spark | 1417 | -23 / +23 | Apr 2026 | |
| 35 | DeepSeek V4 Flash (Reasoning, High Effort) | 1414 | -25 / +26 | Apr 2026 | |
| 36 | MiMo-V2-Pro | 1408 | -24 / +22 | Mar 2026 | |
| 37 | GPT-5.2 (medium) | 1404 | -22 / +23 | Dec 2025 | |
| 38 | Qwen3.6 27B (Reasoning) | 1404 | -23 / +25 | Apr 2026 | |
| 39 | GLM-5 (Reasoning) | 1394 | -23 / +23 | Feb 2026 | |
| 40 | DeepSeek V4 Flash (Non-reasoning) | 1393 | -27 / +26 | Apr 2026 | |
| 41 | Qwen3.6 27B (Non-reasoning) | 1391 | -25 / +23 | Apr 2026 | |
| 42 | DeepSeek V4 Flash (Reasoning, Max Effort) | 1388 | -20 / +34 | Apr 2026 | |
| 43 | Qwen3.6 Plus | 1351 | -24 / +24 | Apr 2026 | |
| 44 | MiMo-V2-Omni-0327 | 1346 | -23 / +24 | Mar 2026 | |
| 45 | GPT-5.4 (Non-reasoning) | 1342 | -26 / +26 | Mar 2026 | |
| 46 | GLM 5V Turbo (Reasoning) | 1331 | -23 / +26 | Apr 2026 | |
| 47 | Kimi K2.6 (Non-reasoning) | 1324 | -28 / +28 | Apr 2026 | |
| 48 | Gemini 3 Deep Think | 1324 | -30 / +31 | Feb 2026 | |
| 49 | GLM-5 (Non-reasoning) | 1323 | -21 / +25 | Feb 2026 | |
| 50 | Claude 4.5 Sonnet (Reasoning) | 1322 | -24 / +25 | Sep 2025 | |
| 51 | MiMo-V2-Omni | 1320 | -24 / +22 | Mar 2026 | |
| 52 | Claude Pro - 4.5 Opus (Extended Thinking) | 1319 | -41 / +38 | - | |
| 53 | GPT-5.4 mini (medium) | 1319 | -22 / +22 | Mar 2026 | |
| 54 | GPT-5.5 (Non-reasoning) | 1314 | -24 / +25 | Apr 2026 | |
| 55 | Gemini 3.1 Pro Preview | 1314 | -26 / +27 | Feb 2026 | |
| 56 | Grok 4.3 (medium) | 1312 | -24 / +24 | Apr 2026 | |
| 57 | Claude 4.5 Sonnet (Non-reasoning) | 1311 | -23 / +22 | Sep 2025 | |
| 58 | Grok 4.3 (Non-reasoning) | 1303 | -25 / +26 | Apr 2026 | |
| 59 | Qwen3.6 35B A3B (Reasoning) | 1297 | -24 / +24 | Apr 2026 | |
| 60 | MiMo-V2.5-Pro (Non-reasoning) | 1296 | -24 / +25 | Apr 2026 | |
| 61 | GPT-5 (high) | 1295 | -21 / +22 | Aug 2025 | |
| 62 | GPT-5.2 Codex (xhigh) | 1289 | -27 / +28 | Dec 2025 | |
| 63 | Kimi K2.5 (Reasoning) | 1287 | -24 / +24 | Jan 2026 | |
| 64 | Kimi K2.5 (Non-reasoning) | 1265 | -26 / +24 | Jan 2026 | |
| 65 | Hy3-preview (Reasoning) | 1238 | -24 / +23 | Apr 2026 | |
| 66 | GPT-5.1 (high) | 1228 | -21 / +24 | Nov 2025 | |
| 67 | Hy3-preview (Non-reasoning) | 1225 | -28 / +26 | Apr 2026 | |
| 68 | GPT-5.2 (Non-reasoning) | 1223 | -23 / +22 | Dec 2025 | |
| 69 | Qwen3.5 397B A17B (Non-reasoning) | 1222 | -23 / +23 | Feb 2026 | |
| 70 | Qwen3.6 35B A3B (Non-reasoning) | 1222 | -24 / +24 | Apr 2026 | |
| 71 | GPT-5 Codex (high) | 1214 | -24 / +22 | Sep 2025 | |
| 72 | Gemini 3 Flash Preview (Reasoning) | 1204 | -24 / +24 | Dec 2025 | |
| 73 | GPT-5.4 nano (medium) | 1200 | -22 / +22 | Mar 2026 | |
| 74 | DeepSeek V3.2 (Reasoning) | 1197 | -24 / +22 | Dec 2025 | |
| 75 | GPT-5.1 Codex (high) | 1193 | -26 / +25 | Nov 2025 | |
| 76 | GPT-5.4 nano (xhigh) | 1191 | -25 / +23 | Mar 2026 | |
| 77 | Qwen3.5 397B A17B (Reasoning) | 1190 | -23 / +22 | Feb 2026 | |
| 78 | GLM-4.7 (Reasoning) | 1186 | -22 / +23 | Dec 2025 | |
| 79 | GPT-5 mini (high) | 1185 | -24 / +23 | Aug 2025 | |
| 80 | Gemini 3 Pro Preview (high) | 1185 | -23 / +22 | Nov 2025 | |
| 81 | Qwen3.5 Omni Plus | 1184 | -22 / +24 | Mar 2026 | |
| 82 | MiniMax-M2.5 | 1180 | -24 / +23 | Feb 2026 | |
| 83 | GLM-4.7 (Non-reasoning) | 1177 | -23 / +23 | Dec 2025 | |
| 84 | Grok 4.20 0309 v2 (Reasoning) | 1172 | -22 / +23 | Apr 2026 | |
| 85 | Claude 4.5 Haiku (Reasoning) | 1171 | -24 / +25 | Oct 2025 | |
| 86 | Gemini 3 Pro Preview (low) | 1168 | -27 / +26 | Nov 2025 | |
| 87 | Mistral Medium 3.5 | 1168 | -25 / +24 | Apr 2026 | |
| 88 | Qwen3.5 27B (Non-reasoning) | 1162 | -24 / +21 | Feb 2026 | |
| 89 | Qwen3.5 27B (Reasoning) | 1158 | -23 / +22 | Feb 2026 | |
| 90 | GPT-5 (low) | 1151 | -23 / +23 | Aug 2025 | |
| 91 | ChatGPT Plus - 5.1 Thinking (Extended Thinking) | 1149 | -41 / +45 | - | |
| 92 | Qwen3 Max Thinking | 1140 | -24 / +22 | Jan 2026 | |
| 93 | Claude 4.5 Haiku (Non-reasoning) | 1136 | -27 / +26 | Oct 2025 | |
| 94 | Claude 4 Sonnet (Reasoning) | 1134 | -28 / +26 | May 2025 | |
| 95 | Claude 4 Sonnet (Non-reasoning) | 1129 | -25 / +23 | May 2025 | |
| 96 | Grok 4.3 (low) | 1124 | -23 / +25 | Apr 2026 | |
| 97 | Ring-2.6-1T | 1124 | -24 / +28 | May 2026 | |
| 98 | KAT Coder Pro V2 | 1119 | -24 / +23 | Mar 2026 | |
| 99 | Gemini 3 Flash Preview (Non-reasoning) | 1117 | -25 / +28 | Dec 2025 | |
| 100 | Qwen3.5 122B A10B (Reasoning) | 1116 | -24 / +21 | Feb 2026 | |
| 101 | Gemma 4 31B (Reasoning) | 1113 | -22 / +23 | Apr 2026 | |
| 102 | Qwen3.5 122B A10B (Non-reasoning) | 1112 | -23 / +23 | Feb 2026 | |
| 103 | MiniMax-M2.1 | 1089 | -24 / +26 | Dec 2025 | |
| 104 | DeepSeek V3.1 (Non-reasoning) | 1082 | -24 / +24 | Aug 2025 | |
| 105 | MiMo-V2-Flash (Reasoning) | 1081 | -26 / +23 | Dec 2025 | |
| 106 | JT-35B-Flash | 1077 | -23 / +25 | May 2026 | |
| 107 | Gemini 2.5 Flash Preview (Sep '25) (Reasoning) | 1073 | -24 / +23 | Sep 2025 | |
| 108 | DeepSeek V3.2 Exp (Non-reasoning) | 1073 | -26 / +24 | Sep 2025 | |
| 109 | Step 3.5 Flash 2603 | 1069 | -24 / +23 | Apr 2026 | |
| 110 | MiMo-V2-Flash (Non-reasoning) | 1064 | -27 / +24 | Dec 2025 | |
| 111 | Step 3.5 Flash | 1055 | -26 / +27 | Feb 2026 | |
| 112 | GPT-5.1 Codex mini (high) | 1052 | -23 / +25 | Nov 2025 | |
| 113 | Qwen3.5 35B A3B (Non-reasoning) | 1050 | -21 / +21 | Feb 2026 | |
| 114 | Claude 3.7 Sonnet (Reasoning) | 1049 | -26 / +24 | Feb 2025 | |
| 115 | Claude 3.7 Sonnet (Non-reasoning) | 1048 | -24 / +24 | Feb 2025 | |
| 116 | Ling-2.6-1T | 1046 | -23 / +24 | Apr 2026 | |
| 117 | Grok 4.1 Fast (Reasoning) | 1044 | -24 / +24 | Nov 2025 | |
| 118 | MiMo-V2-Flash (Feb 2026) | 1044 | -27 / +26 | Dec 2025 | |
| 119 | Grok 4.20 0309 (Reasoning) | 1043 | -22 / +22 | Mar 2026 | |
| 120 | Qwen3 Max | 1040 | -24 / +24 | Sep 2025 | |
| 121 | Grok 4.20 0309 v2 (Non-reasoning) | 1039 | -27 / +27 | Apr 2026 | |
| 122 | MiniMax-M2 | 1032 | -27 / +25 | Oct 2025 | |
| 123 | Perplexity Pro - Labs | 1032 | -41 / +39 | - | |
| 124 | GLM-4.6 (Reasoning) | 1030 | -28 / +27 | Sep 2025 | |
| 125 | Gemma 4 26B A4B (Reasoning) | 1016 | -23 / +24 | Apr 2026 | |
| 126 | Grok 4 Fast (Reasoning) | 1014 | -24 / +24 | Sep 2025 | |
| 127 | o4-mini (high) | 1008 | -24 / +22 | Apr 2025 | |
| 128 | GPT-5.4 mini (Non-Reasoning) | 1006 | -24 / +24 | Mar 2026 | |
| 129 | DeepSeek V3.1 Terminus (Reasoning) | 1005 | -26 / +28 | Sep 2025 | |
| 130 | Gemma 4 31B (Non-reasoning) | 1004 | -23 / +21 | Apr 2026 | |
| 131 | NVIDIA Nemotron 3 Super 120B A12B (Reasoning) | 1003 | -22 / +21 | Mar 2026 | |
| 132 | GPT-5 mini (medium) | 1003 | -28 / +27 | Aug 2025 | |
| 133 | DeepSeek V3.2 Exp (Reasoning) | 1003 | -25 / +23 | Sep 2025 | |
| 134 | GPT-5 (medium) | 1001 | -27 / +27 | Aug 2025 | |
| 135 | GPT-5.1 (Non-reasoning) | 1000 | -0 / +0 | Nov 2025 | |
| 136 | MiniMax M1 80k | 995 | -24 / +25 | Jun 2025 | |
| 137 | Kimi K2 Thinking | 993 | -25 / +23 | Nov 2025 | |
| 138 | Doubao Seed Code | 987 | -27 / +26 | Nov 2025 | |
| 139 | Grok 4 | 985 | -24 / +24 | Jul 2025 | |
| 140 | GLM-4.6 (Non-reasoning) | 985 | -23 / +25 | Sep 2025 | |
| 141 | DeepSeek V3.1 Terminus (Non-reasoning) | 976 | -23 / +25 | Sep 2025 | |
| 142 | Nova 2.0 Pro Preview (medium) | 973 | -27 / +23 | Nov 2025 | |
| 143 | Google AI Pro - Thinking with 3 Pro | 972 | -43 / +43 | - | |
| 144 | Mercury 2 | 958 | -22 / +22 | Feb 2026 | |
| 145 | Gemma 4 26B A4B (Non-reasoning) | 949 | -22 / +22 | Apr 2026 | |
| 146 | Qwen3 Max Thinking (Preview) | 948 | -26 / +27 | Nov 2025 | |
| 147 | gpt-oss-120b (high) | 947 | -28 / +27 | Aug 2025 | |
| 148 | GPT-5.4 nano (Non-Reasoning) | 944 | -32 / +28 | Mar 2026 | |
| 149 | Gemini 3.1 Flash-Lite Preview | 926 | -22 / +23 | Mar 2026 | |
| 150 | Command A+ | 920 | -24 / +24 | May 2026 | |
| 151 | Gemini 2.5 Pro | 915 | -25 / +26 | Jun 2025 | |
| 152 | Qwen3 Coder Next | 914 | -23 / +25 | Feb 2026 | |
| 153 | Grok 4.20 0309 (Non-reasoning) | 910 | -22 / +23 | Mar 2026 | |
| 154 | Qwen3.5 35B A3B (Reasoning) | 908 | -22 / +24 | Feb 2026 | |
| 155 | Qwen3.5 Omni Flash | 898 | -25 / +24 | Mar 2026 | |
| 156 | SuperGrok - Grok 4 | 882 | -46 / +40 | - | |
| 157 | DeepSeek V3.2 (Non-reasoning) | 877 | -27 / +27 | Dec 2025 | |
| 158 | Trinity Large Thinking | 866 | -24 / +23 | Apr 2026 | |
| 159 | Kimi K2 0905 | 865 | -29 / +28 | Sep 2025 | |
| 160 | Mistral Large 3 | 864 | -25 / +23 | Dec 2025 | |
| 161 | Mistral Small 4 (Reasoning) | 863 | -24 / +23 | Mar 2026 | |
| 162 | Devstral 2 | 856 | -24 / +26 | Dec 2025 | |
| 163 | Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) | 853 | -27 / +27 | Sep 2025 | |
| 164 | Nova 2.0 Lite (high) | 852 | -25 / +24 | Oct 2025 | |
| 165 | Mistral Small 4 (Non-reasoning) | 846 | -24 / +21 | Mar 2026 | |
| 166 | Qwen3.5 9B (Non-reasoning) | 843 | -23 / +24 | Mar 2026 | |
| 167 | LongCat Flash Lite | 838 | -27 / +25 | Jan 2026 | |
| 168 | GLM-4.7-Flash (Reasoning) | 838 | -25 / +24 | Jan 2026 | |
| 169 | Devstral Small (May '25) | 833 | -26 / +26 | May 2025 | |
| 170 | gpt-oss-120b (low) | 832 | -23 / +23 | Aug 2025 | |
| 171 | JT-MINI | 831 | -25 / +25 | Apr 2026 | |
| 172 | K-EXAONE (Reasoning) | 826 | -26 / +26 | Dec 2025 | |
| 173 | Qwen3 235B A22B 2507 (Reasoning) | 822 | -24 / +25 | Jul 2025 | |
| 174 | Qwen3 Max (Preview) | 820 | -25 / +22 | Sep 2025 | |
| 175 | Devstral Small 2 | 819 | -25 / +25 | Dec 2025 | |
| 176 | KAT-Coder-Pro V1 | 819 | -26 / +25 | Nov 2025 | |
| 177 | EXAONE 4.5 33B | 814 | -24 / +24 | Apr 2026 | |
| 178 | GLM-4.7-Flash (Non-reasoning) | 803 | -38 / +37 | Jan 2026 | |
| 179 | ERNIE 5.0 Thinking Preview | 790 | -27 / +27 | Nov 2025 | |
| 180 | Grok 4.1 Fast (Non-reasoning) | 785 | -28 / +28 | Nov 2025 | |
| 181 | Nova 2.0 Omni (medium) | 784 | -28 / +27 | Nov 2025 | |
| 182 | Ling 2.6 Flash | 783 | -22 / +23 | Apr 2026 | |
| 183 | Mistral Medium 3.1 | 781 | -26 / +27 | Aug 2025 | |
| 184 | Qwen3 235B A22B 2507 Instruct | 781 | -27 / +27 | Jul 2025 | |
| 185 | GPT-4.1 | 777 | -27 / +26 | Apr 2025 | |
| 186 | Grok 4 Fast (Non-reasoning) | 777 | -25 / +27 | Sep 2025 | |
| 187 | Qwen3 VL 4B (Reasoning) | 776 | -39 / +40 | Oct 2025 | |
| 188 | Nemotron 3 Nano Omni 30B A3B Reasoning | 767 | -29 / +26 | Apr 2026 | |
| 189 | Grok Code Fast 1 | 765 | -27 / +28 | Aug 2025 | |
| 190 | K-EXAONE (Non-reasoning) | 763 | -27 / +24 | Dec 2025 | |
| 191 | Seed-OSS-36B-Instruct | 760 | -26 / +26 | Aug 2025 | |
| 192 | Nemotron Cascade 2 30B A3B | 759 | -25 / +23 | Mar 2026 | |
| 193 | Qwen3 235B A22B (Reasoning) | 757 | -28 / +27 | Apr 2025 | |
| 194 | GPT-5 nano (high) | 756 | -27 / +25 | Aug 2025 | |
| 195 | o3 | 754 | -30 / +29 | Apr 2025 | |
| 196 | INTELLECT-3 | 752 | -27 / +25 | Nov 2025 | |
| 197 | o3-mini (high) | 748 | -25 / +27 | Jan 2025 | |
| 198 | Gemini 2.5 Flash (Non-reasoning) | 742 | -28 / +28 | May 2025 | |
| 199 | Sarvam 105B (high) | 740 | -24 / +22 | Mar 2026 | |
| 200 | Qwen3 235B A22B (Non-reasoning) | 740 | -29 / +27 | Apr 2025 | |
| 201 | o1 | 737 | -27 / +26 | Dec 2024 | |
| 202 | Qwen3 Next 80B A3B (Reasoning) | 727 | -27 / +25 | Sep 2025 | |
| 203 | Qwen3.5 9B (Reasoning) | 717 | -22 / +22 | Mar 2026 | |
| 204 | Qwen3 VL 235B A22B (Reasoning) | 716 | -27 / +27 | Sep 2025 | |
| 205 | Qwen3 Coder 30B A3B Instruct | 713 | -27 / +24 | Jul 2025 | |
| 206 | Claude 3.5 Haiku | 708 | -25 / +24 | Oct 2024 | |
| 207 | Gemini 2.5 Flash (Reasoning) | 699 | -31 / +29 | May 2025 | |
| 208 | Devstral Medium | 693 | -27 / +25 | Jul 2025 | |
| 209 | GLM-4.6V (Non-reasoning) | 693 | -28 / +26 | Dec 2025 | |
| 210 | Ring-1T | 688 | -26 / +26 | Oct 2025 | |
| 211 | Qwen3 VL 8B Instruct | 684 | -36 / +39 | Oct 2025 | |
| 212 | DeepSeek R1 0528 (May '25) | 682 | -28 / +26 | May 2025 | |
| 213 | HyperCLOVA X SEED Think (32B) | 681 | -26 / +24 | Dec 2025 | |
| 214 | Solar Pro 3 | 677 | -24 / +24 | Apr 2026 | |
| 215 | Magistral Small 1.2 | 670 | -26 / +25 | Sep 2025 | |
| 216 | Qwen3.5 4B (Non-reasoning) | 670 | -25 / +23 | Mar 2026 | |
| 217 | Qwen3 VL 8B (Reasoning) | 670 | -27 / +26 | Oct 2025 | |
| 218 | Grok 3 | 668 | -27 / +27 | Feb 2025 | |
| 219 | Qwen3 VL 30B A3B (Reasoning) | 668 | -38 / +37 | Oct 2025 | |
| 220 | Magistral Medium 1 | 667 | -28 / +27 | Jun 2025 | |
| 221 | Solar Open 100B (Reasoning) | 666 | -31 / +27 | Dec 2025 | |
| 222 | Qwen3 30B A3B 2507 (Reasoning) | 664 | -26 / +24 | Jul 2025 | |
| 223 | Nova 2.0 Pro Preview (low) | 661 | -28 / +29 | Nov 2025 | |
| 224 | Ministral 3 14B | 658 | -28 / +27 | Dec 2025 | |
| 225 | gpt-oss-20B (high) | 653 | -26 / +25 | Aug 2025 | |
| 226 | Qwen3 VL 32B (Reasoning) | 648 | -29 / +28 | Oct 2025 | |
| 227 | Nova 2.0 Lite (medium) | 644 | -25 / +25 | Oct 2025 | |
| 228 | Mi:dm K 2.5 Pro | 643 | -27 / +26 | Dec 2025 | |
| 229 | Ministral 3 8B | 640 | -29 / +29 | Dec 2025 | |
| 230 | Qwen3 VL 235B A22B Instruct | 637 | -37 / +39 | Sep 2025 | |
| 231 | Magistral Medium 1.2 | 629 | -28 / +26 | Sep 2025 | |
| 232 | Qwen3 Next 80B A3B Instruct | 627 | -27 / +27 | Sep 2025 | |
| 233 | GPT-4.1 mini | 621 | -28 / +28 | Apr 2025 | |
| 234 | DeepSeek V3.1 (Reasoning) | 613 | -29 / +25 | Aug 2025 | |
| 235 | GLM-4.6V (Reasoning) | 610 | -29 / +29 | Dec 2025 | |
| 236 | K2 Think V2 | 608 | -27 / +26 | Dec 2025 | |
| 237 | GPT-5 nano (medium) | 595 | -28 / +26 | Aug 2025 | |
| 238 | Qwen3 4B 2507 (Reasoning) | 590 | -28 / +27 | Aug 2025 | |
| 239 | Mistral Medium 3 | 587 | -29 / +27 | May 2025 | |
| 240 | Hermes 4 - Llama-3.1 405B (Reasoning) | 587 | -25 / +25 | Aug 2025 | |
| 241 | K2-V2 (medium) | 582 | -28 / +27 | Dec 2025 | |
| 242 | Apriel-v1.6-15B-Thinker | 575 | -27 / +27 | Nov 2025 | |
| 243 | Gemini 2.0 Flash (Feb '25) | 571 | -27 / +24 | Feb 2025 | |
| 244 | Devstral Small (Jul '25) | 565 | -30 / +29 | Jul 2025 | |
| 245 | NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) | 565 | -27 / +28 | Dec 2025 | |
| 246 | K2-V2 (high) | 562 | -29 / +26 | Dec 2025 | |
| 247 | GLM-4.5-Air | 560 | -31 / +29 | Jul 2025 | |
| 248 | gpt-oss-20B (low) | 550 | -27 / +26 | Aug 2025 | |
| 249 | Granite 4.1 8B | 542 | -27 / +26 | Apr 2026 | |
| 250 | Hermes 4 - Llama-3.1 70B (Reasoning) | 539 | -23 / +23 | Aug 2025 | |
| 251 | Kimi K2 | 527 | -34 / +31 | Jul 2025 | |
| 252 | Hermes 4 - Llama-3.1 70B (Non-reasoning) | 523 | -24 / +24 | Aug 2025 | |
| 253 | Qwen3 30B A3B 2507 Instruct | 516 | -27 / +28 | Jul 2025 | |
| 254 | GLM-4.5V (Reasoning) | 511 | -24 / +22 | Aug 2025 | |
| 255 | Hermes 4 - Llama-3.1 405B (Non-reasoning) | 511 | -24 / +23 | Aug 2025 | |
| 256 | Qwen3.5 4B (Reasoning) | 511 | -28 / +29 | Mar 2026 | |
| 257 | Nova 2.0 Lite (low) | 509 | -28 / +27 | Oct 2025 | |
| 258 | Qwen3 Coder 480B A35B Instruct | 507 | -31 / +28 | Jul 2025 | |
| 259 | Nova Premier | 507 | -30 / +29 | Apr 2025 | |
| 260 | Qwen3 30B A3B (Reasoning) | 503 | -27 / +26 | Apr 2025 | |
| 261 | EXAONE 4.0 32B (Reasoning) | 502 | -29 / +27 | Jul 2025 | |
| 262 | Molmo2-8B | 500 | -0 / +0 | Dec 2025 | |
| 263 | DeepSeek V3.2 Speciale | 500 | -0 / +0 | Dec 2025 | |
| 264 | Qwen3 8B (Reasoning) | 498 | -28 / +27 | Apr 2025 | |
| 265 | Qwen3 VL 30B A3B Instruct | 498 | -30 / +28 | Oct 2025 | |
| 266 | Granite 4.1 30B | 497 | -26 / +24 | Apr 2026 | |
| 267 | Qwen3 Omni 30B A3B (Reasoning) | 497 | -28 / +25 | Sep 2025 | |
| 268 | Qwen3 32B (Reasoning) | 491 | -28 / +25 | Apr 2025 | |
| 269 | Motif-2-12.7B-Reasoning | 485 | -30 / +27 | Dec 2025 | |
| 270 | Ministral 3 3B | 485 | -29 / +27 | Dec 2025 | |
| 271 | NVIDIA Nemotron 3 Nano 4B | 478 | -30 / +30 | Mar 2026 | |
| 272 | Qwen3 14B (Reasoning) | 477 | -26 / +28 | Apr 2025 | |
| 273 | Qwen3 14B (Non-reasoning) | 474 | -28 / +27 | Apr 2025 | |
| 274 | Qwen3 8B (Non-reasoning) | 470 | -27 / +27 | Apr 2025 | |
| 275 | GPT-5 mini (minimal) | 470 | -31 / +31 | Aug 2025 | |
| 276 | GLM-4.5 (Reasoning) | 469 | -34 / +32 | Jul 2025 | |
| 277 | GLM-4.5V (Non-reasoning) | 461 | -31 / +28 | Aug 2025 | |
| 278 | Solar Pro 2 (Reasoning) | 451 | -31 / +27 | Jul 2025 | |
| 279 | Solar Pro 2 (Non-reasoning) | 447 | -29 / +27 | Jul 2025 | |
| 280 | NVIDIA Nemotron Nano 9B V2 (Reasoning) | 440 | -28 / +28 | Aug 2025 | |
| 281 | Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) | 438 | -30 / +30 | Sep 2025 | |
| 282 | Llama 4 Maverick | 438 | -29 / +28 | Apr 2025 | |
| 283 | Ling-flash-2.0 | 420 | -31 / +28 | Sep 2025 | |
| 284 | Grok 3 mini Reasoning (high) | 420 | -39 / +39 | Feb 2025 | |
| 285 | DeepSeek V3 (Dec '24) | 409 | -27 / +29 | Dec 2024 | |
| 286 | DeepSeek V3 0324 | 408 | -29 / +29 | Mar 2025 | |
| 287 | Llama 3.3 Instruct 70B | 401 | -33 / +28 | Dec 2024 | |
| 288 | Ling-1T | 400 | -27 / +28 | Oct 2025 | |
| 289 | GPT-5 (minimal) | 388 | -29 / +32 | Aug 2025 | |
| 290 | Nova Pro | 388 | -29 / +28 | Dec 2024 | |
| 291 | Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) | 382 | -28 / +30 | Sep 2025 | |
| 292 | Nova 2.0 Lite (Non-reasoning) | 381 | -29 / +30 | Oct 2025 | |
| 293 | Llama Nemotron Super 49B v1.5 (Non-reasoning) | 380 | -27 / +28 | Jul 2025 | |
| 294 | Claude 3 Haiku | 379 | -26 / +22 | Mar 2024 | |
| 295 | GPT-4o (Aug '24) | 378 | -27 / +27 | Aug 2024 | |
| 296 | Tri-21B-Think | 374 | -25 / +24 | Feb 2026 | |
| 297 | Falcon-H1R-7B | 373 | -32 / +29 | Jan 2026 | |
| 298 | Llama Nemotron Super 49B v1.5 (Reasoning) | 369 | -30 / +29 | Jul 2025 | |
| 299 | K2-V2 (low) | 368 | -30 / +28 | Dec 2025 | |
| 300 | Granite 4.1 3B | 365 | -27 / +25 | Apr 2026 | |
| 301 | Nova 2.0 Omni (low) | 361 | -33 / +30 | Nov 2025 | |
| 302 | Sarvam 30B (high) | 360 | -26 / +23 | Mar 2026 | |
| 303 | Olmo 3.1 32B Instruct | 358 | -31 / +27 | Jan 2026 | |
| 304 | Nanbeige4.1-3B | 357 | -31 / +29 | Feb 2026 | |
| 305 | GPT-4o (Nov '24) | 349 | -25 / +24 | Nov 2024 | |
| 306 | NVIDIA Nemotron 3 Nano 30B A3B (Non-reasoning) | 349 | -29 / +28 | Dec 2025 | |
| 307 | Nova Lite | 344 | -32 / +27 | Dec 2024 | |
| 308 | Granite 4.0 H Small | 344 | -30 / +27 | Sep 2025 | |
| 309 | Qwen3 VL 4B Instruct | 343 | -26 / +29 | Oct 2025 | |
| 310 | Nova Micro | 340 | -31 / +34 | Dec 2024 | |
| 311 | Mistral Small 3.1 | 338 | -28 / +27 | Mar 2025 | |
| 312 | Llama 3.1 Nemotron Instruct 70B | 337 | -30 / +31 | Oct 2024 | |
| 313 | Tri-21B-think Preview | 337 | -33 / +30 | Feb 2026 | |
| 314 | Qwen3 30B A3B (Non-reasoning) | 332 | -28 / +28 | Apr 2025 | |
| 315 | EXAONE 4.0 32B (Non-reasoning) | 330 | -33 / +29 | Jul 2025 | |
| 316 | NVIDIA Nemotron Nano 12B v2 VL (Reasoning) | 328 | -28 / +27 | Oct 2025 | |
| 317 | Mistral Large 2 (Nov '24) | 325 | -31 / +30 | Nov 2024 | |
| 318 | Qwen3.5 2B (Reasoning) | 322 | -25 / +22 | Mar 2026 | |
| 319 | Gemini 2.5 Flash-Lite (Reasoning) | 320 | -32 / +27 | Jun 2025 | |
| 320 | GPT-4.1 nano | 320 | -29 / +30 | Apr 2025 | |
| 321 | Nova 2.0 Pro Preview (Non-reasoning) | 319 | -32 / +26 | Nov 2025 | |
| 322 | Qwen3 0.6B (Reasoning) | 315 | -30 / +29 | Apr 2025 | |
| 323 | Qwen3 4B 2507 Instruct | 307 | -32 / +28 | Aug 2025 | |
| 324 | Nova 2.0 Omni (Non-reasoning) | 305 | -32 / +28 | Nov 2025 | |
| 325 | Mistral Small 3.2 | 304 | -31 / +30 | Jun 2025 | |
| 326 | Gemini 2.5 Flash-Lite (Non-reasoning) | 304 | -30 / +32 | Jun 2025 | |
| 327 | NVIDIA Nemotron Nano 9B V2 (Non-reasoning) | 303 | -30 / +30 | Aug 2025 | |
| 328 | Gemma 4 E4B (Reasoning) | 303 | -25 / +23 | Apr 2026 | |
| 329 | Qwen3 VL 32B Instruct | 301 | -32 / +29 | Oct 2025 | |
| 330 | Exaone 4.0 1.2B (Non-reasoning) | 296 | -31 / +29 | Jul 2025 | |
| 331 | Qwen3 Omni 30B A3B Instruct | 296 | -34 / +30 | Sep 2025 | |
| 332 | Exaone 4.0 1.2B (Reasoning) | 294 | -29 / +27 | Jul 2025 | |
| 333 | Granite 4.0 H 350M | 293 | -32 / +26 | Oct 2025 | |
| 334 | Gemma 4 E4B (Non-reasoning) | 292 | -26 / +24 | Apr 2026 | |
| 335 | Gemma 3 27B Instruct | 287 | -31 / +30 | Mar 2025 | |
| 336 | NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) | 287 | -28 / +28 | Oct 2025 | |
| 337 | Jamba 1.7 Large | 285 | -30 / +29 | Jul 2025 | |
| 338 | Llama 3.1 Instruct 70B | 284 | -31 / +27 | Jul 2024 | |
| 339 | GPT-5 nano (minimal) | 283 | -32 / +31 | Aug 2025 | |
| 340 | Gemma 3 12B Instruct | 280 | -31 / +28 | Mar 2025 | |
| 341 | Granite 4.0 Micro | 279 | -30 / +26 | Sep 2025 | |
| 342 | Qwen3 0.6B (Non-reasoning) | 279 | -33 / +29 | Apr 2025 | |
| 343 | Llama 3.1 Instruct 8B | 278 | -34 / +29 | Jul 2024 | |
| 344 | Qwen3.5 0.8B (Reasoning) | 277 | -25 / +24 | Mar 2026 | |
| 345 | Olmo 3 7B Instruct | 277 | -29 / +28 | Nov 2025 | |
| 346 | Command A | 277 | -28 / +26 | Mar 2025 | |
| 347 | Jamba 1.7 Mini | 276 | -31 / +28 | Jul 2025 | |
| 348 | Qwen3 1.7B (Reasoning) | 275 | -33 / +30 | Apr 2025 | |
| 349 | LFM2 1.2B | 272 | -30 / +31 | Jul 2025 | |
| 350 | Llama 4 Scout | 272 | -30 / +30 | Apr 2025 | |
| 351 | Gemma 4 E2B (Reasoning) | 272 | -26 / +23 | Apr 2026 | |
| 352 | Granite 4.0 H 1B | 270 | -32 / +28 | Oct 2025 | |
| 353 | Granite 4.0 350M | 269 | -33 / +29 | Oct 2025 | |
| 354 | LFM2.5-1.2B-Instruct | 267 | -33 / +32 | Jan 2026 | |
| 355 | Ling-mini-2.0 | 263 | -25 / +24 | Sep 2025 | |
| 356 | MiniCPM-V 4.6 1.3B | 261 | -31 / +26 | May 2026 | |
| 357 | Granite 4.0 1B | 259 | -31 / +31 | Oct 2025 | |
| 358 | LFM2 8B A1B | 259 | -33 / +29 | Oct 2025 | |
| 359 | Step3 VL 10B | 258 | -32 / +28 | Jan 2026 | |
| 360 | Qwen3 1.7B (Non-reasoning) | 257 | -33 / +29 | Apr 2025 | |
| 361 | Llama 3.1 Instruct 405B | 256 | -31 / +29 | Jul 2024 | |
| 362 | Gemma 3 4B Instruct | 256 | -30 / +29 | Mar 2025 | |
| 363 | Jamba Reasoning 3B | 254 | -29 / +27 | Oct 2025 | |
| 364 | Gemma 4 E2B (Non-reasoning) | 253 | -27 / +25 | Apr 2026 | |
| 365 | LFM2.5-1.2B-Thinking | 252 | -32 / +28 | Jan 2026 | |
| 366 | DeepSeek R1 (Jan '25) | 249 | -29 / +27 | Jan 2025 | |
| 367 | Gemma 3n E4B Instruct | 245 | -32 / +30 | Jun 2025 | |
| 368 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | 239 | -25 / +24 | Apr 2025 | |
| 369 | Qwen3.5 2B (Non-reasoning) | 238 | -24 / +25 | Mar 2026 | |
| 370 | LFM2 2.6B | 236 | -30 / +29 | Sep 2025 | |
| 371 | LFM2 24B A2B | 235 | -24 / +25 | Feb 2026 | |
| 372 | Qwen3.5 0.8B (Non-reasoning) | 234 | -26 / +26 | Mar 2026 | |
| 373 | LFM2.5-VL-1.6B | 234 | -34 / +28 | Jan 2026 | |
| 374 | Phi-4 Mini Instruct | 231 | -30 / +27 | Feb 2024 | |
| 375 | Granite 3.3 8B (Non-reasoning) | 225 | -31 / +28 | Apr 2025 |
Example Tasks
Frequently Asked Questions
GDPval-AA is Artificial Analysis' evaluation based on OpenAI's GDPval dataset, which tests AI models on real-world economically valuable tasks across 44 occupations and 9 major industries.
GDPval-AA compares model submissions head-to-head on the same task. For each matchup, the two outputs are anonymized and an LLM judge picks a winner. These blind pairwise results are aggregated into an Elo rating per model.
GPT-5.5 (xhigh) has the highest GDPval-AA score, with a GDPval-AA Elo rating of 1,769 among models with published GDPval-AA results. View model
GDPval-AA covers real-world professional tasks across a range of occupations and industries, producing outputs such as documents, spreadsheets, slides, and diagrams. Generating these deliverables generally requires interacting with a sandbox filesystem through shell access and using web search, capabilities the model is given through the Stirrup agentic harness.
Most benchmarks test short-answer or multiple-choice responses. GDPval-AA instead evaluates complete deliverables: models operate in an agentic environment with tools, produce file outputs, and have their submissions scored through pairwise grading on relative quality.
Explore Evaluations
A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.
GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.
Artificial Analysis' implementation of the APEX-Agents benchmark, testing AI agents on long-horizon, cross-application tasks in professional-services environments with realistic application tooling.
A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.
An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.
A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.
A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
A benchmark measuring factual recall and hallucination across various economically relevant domains.
A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.
A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.
The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.
A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.
A composite measure providing an industry standard to communicate model openness for users and developers.
An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.
A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.
A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.
A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.
All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.
An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.