Artificial Analysis Openness Index
Background
Methodology
Highlights
- Olmo 3.1 32B Instruct scores the highest on Openness Index with a score of 89, followed by Olmo 3.1 32B Think with a score of 89, and Olmo 3 7B Instruct with a score of 89
- GPT-5 nano (high) scores the lowest on Openness Index with a score of 6, followed by o3 with a score of 6, and GPT-5 mini (high) with a score of 6
Artificial Analysis Openness Index: Results
Artificial Analysis Openness Index: Components
Artificial Analysis Openness Index: Model Availability vs. Model Transparency
Artificial Analysis Openness Index: Score vs. Release Date
Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index
Openness Index Composition
Detailed methodologyScoring methodology
Each component is scored on a 0-3 qualitative scale based on the best-fitting openness 'archetype', with each model assessed based on the full set of public first-party information available.
We synthesize these underlying factors into a unified metric, the Artificial Analysis Openness Index, as follows:
- Data elements are averaged between pre- and post-training (to give a total of 6 possible points across data)
- All component scores are added (up to a maximum of 18/18 points)
- This score is normalized to a 0-100 scale
Where models are derived from a third-party base model, they may be constrained by the licensing or limited disclosure of the upstream model. For incremental/update releases, we only consider disclosures explicitly about the new release (including allowing model creators to declare which components remain consistent with an earlier release).
Openness Index Leaderboard
| 1 | Olmo 3.1 32B Instruct | 88.89 | 12.01 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 2 | Olmo 3.1 32B Think | 88.89 | 14.24 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 3 | Olmo 3 7B Instruct | 88.89 | 8.14 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 4 | Olmo 3 7B Think | 88.89 | 16.80 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 5 | Molmo 7B-D | 88.89 | 9.25 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 6 | K2-V2 (low) | 88.89 | 14.44 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 7 | K2-V2 (high) | 88.89 | 20.67 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 8 | K2 Think V2 | 88.89 | 24.52 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 9 | K2-V2 (medium) | 88.89 | 18.70 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 10 | Olmo 3 32B Think | 88.89 | 18.89 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 11 | OLMo 2 7B | 88.89 | 9.30 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 12 | OLMo 2 32B | 88.89 | 10.57 | 6.00 | 10.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 13 | NVIDIA Nemotron Nano 12B v2 VL (Non-reasoning) | 72.22 | 10.11 | 6.00 | 7.00 | 2.00 | 1.00 | 2.00 | 1.00 | |
| 14 | NVIDIA Nemotron 3 Nano 30B A3B (Reasoning) | 72.22 | 24.26 | 6.00 | 7.00 | 2.00 | 1.00 | 2.00 | 1.00 | |
| 15 | NVIDIA Nemotron Nano 9B V2 (Non-reasoning) | 72.22 | 13.10 | 6.00 | 7.00 | 2.00 | 1.00 | 2.00 | 1.00 | |
| 16 | NVIDIA Nemotron Nano 12B v2 VL (Reasoning) | 72.22 | 14.78 | 6.00 | 7.00 | 2.00 | 1.00 | 2.00 | 1.00 | |
| 17 | NVIDIA Nemotron Nano 9B V2 (Reasoning) | 72.22 | 14.76 | 6.00 | 7.00 | 2.00 | 1.00 | 2.00 | 1.00 | |
| 18 | Molmo2-8B | 72.22 | - | 6.00 | 7.00 | 3.00 | 1.00 | 3.00 | 1.00 | |
| 19 | Kimi Linear 48B A3B Instruct | 61.11 | 14.41 | 6.00 | 5.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 20 | ERNIE 4.5 300B A47B | 55.56 | 17.26 | 6.00 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 21 | GLM-4.5-Air | 55.56 | 23.16 | 6.00 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 22 | GLM-4.5 (Reasoning) | 55.56 | 26.21 | 6.00 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 23 | Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) | 52.78 | 14.43 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 24 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | 52.78 | 20.02 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 25 | Llama Nemotron Super 49B v1.5 (Non-reasoning) | 52.78 | 14.51 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 26 | Llama 3.3 Nemotron Super 49B v1 (Non-reasoning) | 52.78 | 14.35 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 27 | Llama 3.3 Nemotron Super 49B v1 (Reasoning) | 52.78 | 18.49 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 28 | Llama Nemotron Super 49B v1.5 (Reasoning) | 52.78 | 18.62 | 4.00 | 5.50 | 1.00 | 0.00 | 1.00 | 1.00 | |
| 29 | MiMo-V2-Flash (Reasoning) | 52.78 | 39.24 | 6.00 | 3.50 | 0.00 | 0.00 | 1.00 | 0.00 | |
| 30 | GLM-4.5V (Reasoning) | 52.78 | 19.27 | 6.00 | 3.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 31 | GLM-4.5V (Non-reasoning) | 52.78 | 12.53 | 6.00 | 3.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 32 | Gemma 3 12B Instruct | 50.00 | 8.79 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 33 | Gemma 3n E2B Instruct | 50.00 | 9.73 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 34 | Gemma 3 4B Instruct | 50.00 | 6.31 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 35 | Gemma 3 27B Instruct | 50.00 | 10.19 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 36 | Gemma 3 1B Instruct | 50.00 | 8.65 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 37 | Gemma 3n E4B Instruct | 50.00 | 6.30 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 38 | Magistral Small 1.2 | 50.00 | 22.55 | 6.00 | 3.00 | 0.00 | 0.00 | 1.00 | 1.00 | |
| 39 | DeepSeek R1 0528 (May '25) | 50.00 | 27.01 | 6.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 40 | Phi-4 | 50.00 | 13.18 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 41 | Phi-4 Multimodal Instruct | 50.00 | 10.04 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 42 | Phi-4 Mini Instruct | 50.00 | 10.94 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 43 | Qwen3 VL 8B Instruct | 50.00 | 14.25 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 44 | Qwen3 VL 32B (Reasoning) | 50.00 | 24.52 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 45 | Qwen3 VL 32B Instruct | 50.00 | 17.17 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 46 | Qwen3 VL 30B A3B (Reasoning) | 50.00 | 19.62 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 47 | Qwen3 VL 235B A22B (Reasoning) | 50.00 | 27.51 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 48 | Qwen3 VL 235B A22B Instruct | 50.00 | 20.58 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 49 | Qwen3 VL 30B A3B Instruct | 50.00 | 16.03 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 50 | Qwen3 VL 4B (Reasoning) | 50.00 | 14.90 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 51 | Qwen3 VL 8B (Reasoning) | 50.00 | 16.61 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 52 | Qwen3 VL 4B Instruct | 50.00 | 14.08 | 6.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 53 | DeepSeek R1 0528 Qwen3 8B | 47.22 | 16.43 | 6.00 | 2.50 | 0.00 | 0.00 | 1.00 | 0.00 | |
| 54 | Hermes 4 - Llama-3.1 405B (Reasoning) | 47.22 | 21.72 | 4.00 | 4.50 | 1.00 | 0.00 | 2.00 | 0.00 | |
| 55 | Hermes 4 - Llama-3.1 405B (Non-reasoning) | 47.22 | 17.12 | 4.00 | 4.50 | 1.00 | 0.00 | 2.00 | 0.00 | |
| 56 | Hermes 4 - Llama-3.1 70B (Reasoning) | 47.22 | 20.39 | 4.00 | 4.50 | 1.00 | 0.00 | 2.00 | 0.00 | |
| 57 | Hermes 4 - Llama-3.1 70B (Non-reasoning) | 47.22 | 13.55 | 4.00 | 4.50 | 1.00 | 0.00 | 2.00 | 0.00 | |
| 58 | Apriel-v1.5-15B-Thinker | 47.22 | 28.33 | 6.00 | 2.50 | 0.00 | 0.00 | 1.00 | 0.00 | |
| 59 | Gemma 3 270M | 44.44 | 8.37 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 60 | Falcon-H1R-7B | 44.44 | 15.84 | 4.00 | 4.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 61 | Llama 3.1 Nemotron Instruct 70B | 44.44 | 13.42 | 4.00 | 4.00 | 0.00 | 0.00 | 1.00 | 1.00 | |
| 62 | GLM-4.7 (Reasoning) | 44.44 | 42.05 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 63 | GLM-4.7 (Non-reasoning) | 44.44 | 34.10 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 64 | GLM-4.7-Flash (Non-reasoning) | 44.44 | 21.47 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 65 | GLM-4.7-Flash (Reasoning) | 44.44 | 30.12 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 66 | Qwen3 4B 2507 Instruct | 44.44 | 13.19 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 67 | Qwen3 235B A22B 2507 Instruct | 44.44 | 24.66 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 68 | Qwen3 Coder 30B A3B Instruct | 44.44 | 19.96 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 69 | Qwen3 Next 80B A3B (Reasoning) | 44.44 | 26.49 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 70 | Qwen3 Coder 480B A35B Instruct | 44.44 | 24.65 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 71 | Qwen3 Next 80B A3B Instruct | 44.44 | 20.08 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 72 | Qwen3 30B A3B 2507 (Reasoning) | 44.44 | 22.43 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 73 | Qwen3 30B A3B 2507 Instruct | 44.44 | 15.00 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 74 | Qwen3 Omni 30B A3B (Reasoning) | 44.44 | 15.60 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 75 | Qwen3 Omni 30B A3B Instruct | 44.44 | 10.68 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 76 | Qwen3 235B A22B 2507 (Reasoning) | 44.44 | 29.46 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 77 | Qwen3 4B 2507 (Reasoning) | 44.44 | 18.60 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 78 | Ling-mini-2.0 | 44.44 | 15.09 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 79 | Ling-flash-2.0 | 44.44 | 15.47 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 80 | Ling-1T | 44.44 | 19.01 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 81 | Devstral Small (Jul '25) | 44.44 | 15.20 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 82 | DeepSeek V3.2 Exp (Non-reasoning) | 44.44 | 28.33 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 83 | DeepSeek V3.2 Exp (Reasoning) | 44.44 | 32.90 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 84 | Kimi K2 | 44.44 | 26.19 | 4.00 | 4.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 85 | GLM-4.6 (Reasoning) | 44.44 | 32.52 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 86 | GLM-4.6 (Non-reasoning) | 44.44 | 30.15 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 87 | Seed-OSS-36B-Instruct | 44.44 | 24.99 | 6.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 88 | Granite 4.0 Micro | 41.67 | 7.66 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 89 | Granite 4.0 H 350M | 41.67 | 5.31 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 90 | Granite 4.0 H Small | 41.67 | 10.79 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 91 | Granite 4.0 H 1B | 41.67 | 7.96 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 92 | Granite 4.0 350M | 41.67 | 6.62 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 93 | Granite 4.0 1B | 41.67 | 7.27 | 6.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 94 | gpt-oss-20B (high) | 38.89 | 24.47 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 95 | gpt-oss-120B (high) | 38.89 | 33.25 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 96 | Llama 3.3 Instruct 70B | 38.89 | 14.23 | 4.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 97 | Llama 3.1 Instruct 405B | 38.89 | 14.20 | 4.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 98 | Llama 3.2 Instruct 90B (Vision) | 38.89 | 11.90 | 4.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 99 | Llama 3.2 Instruct 11B (Vision) | 38.89 | 10.89 | 4.00 | 3.00 | 1.00 | 0.00 | 1.00 | 0.00 | |
| 100 | Mistral Small 3.2 | 38.89 | 15.03 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 101 | Mistral Large 3 | 38.89 | 22.72 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 102 | R1 1776 | 38.89 | 11.99 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 103 | Reka Flash 3 | 38.89 | 14.35 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 104 | DeepHermes 3 - Mistral 24B Preview (Non-reasoning) | 38.89 | 10.89 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 105 | Cogito v2.1 (Reasoning) | 38.89 | - | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 106 | Jamba Reasoning 3B | 38.89 | 10.33 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 107 | Ring-flash-2.0 | 38.89 | 20.58 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 108 | Ring-1T | 38.89 | 22.54 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 109 | DeepSeek V3.1 Terminus (Reasoning) | 38.89 | 33.79 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 110 | DeepSeek V3.1 Terminus (Non-reasoning) | 38.89 | 28.37 | 6.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 111 | DeepSeek R1 Distill Llama 70B | 36.11 | 15.95 | 4.00 | 2.50 | 0.00 | 0.00 | 1.00 | 0.00 | |
| 112 | LFM2 2.6B | 33.33 | 7.86 | 4.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 113 | LFM2 8B A1B | 33.33 | 6.85 | 4.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 114 | DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning) | 33.33 | 7.58 | 5.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 115 | Command A | 33.33 | 13.44 | 3.00 | 3.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 116 | LFM2 1.2B | 33.33 | 6.36 | 4.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 117 | HyperCLOVA X SEED Think (32B) | 30.56 | 23.72 | 4.00 | 1.50 | 1.00 | 0.00 | 0.00 | 0.00 | |
| 118 | Llama 4 Scout | 27.78 | 13.48 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 119 | Llama 4 Maverick | 27.78 | 18.30 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 120 | Magistral Medium 1.2 | 27.78 | 27.04 | 2.00 | 3.00 | 0.00 | 0.00 | 1.00 | 1.00 | |
| 121 | LFM2.5-1.2B-Thinking | 27.78 | 8.12 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 122 | LFM2.5-1.2B-Instruct | 27.78 | 7.95 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 123 | LFM2.5-VL-1.6B | 27.78 | 6.06 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 124 | MiniMax-M2.1 | 27.78 | 39.55 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 125 | Kimi K2 0905 | 27.78 | 30.81 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 126 | Kimi K2 Thinking | 27.78 | 40.70 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 127 | K-EXAONE (Reasoning) | 27.78 | 32.13 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 128 | Exaone 4.0 1.2B (Non-reasoning) | 27.78 | 8.07 | 3.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 129 | Exaone 4.0 1.2B (Reasoning) | 27.78 | 8.26 | 3.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 130 | EXAONE 4.0 32B (Non-reasoning) | 27.78 | 11.54 | 3.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 131 | EXAONE 4.0 32B (Reasoning) | 27.78 | 16.65 | 3.00 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 132 | MiniMax-M2 | 27.78 | 35.98 | 4.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 133 | Jamba 1.7 Mini | 22.22 | 7.33 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 134 | Jamba 1.7 Large | 22.22 | 9.27 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 135 | Qwen3 Max | 16.67 | 31.33 | 2.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 136 | Qwen3 Max Thinking (Preview) | 16.67 | 32.45 | 2.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 137 | Gemini 2.5 Flash-Lite Preview (Sep '25) (Non-reasoning) | 11.11 | 19.40 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 138 | Claude 4.5 Sonnet (Non-reasoning) | 11.11 | 37.06 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 139 | Claude 4.5 Sonnet (Reasoning) | 11.11 | 42.92 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 140 | Claude 4.5 Haiku (Non-reasoning) | 11.11 | 31.03 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 141 | Claude Opus 4.5 (Non-reasoning) | 11.11 | 43.05 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 142 | Claude Opus 4.5 (Reasoning) | 11.11 | 49.69 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 143 | Claude 4.5 Haiku (Reasoning) | 11.11 | 37.02 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 144 | Mistral Medium 3.1 | 11.11 | 21.13 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 145 | Grok 4.1 Fast (Non-reasoning) | 11.11 | 23.54 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 146 | Grok 3 mini Reasoning (high) | 11.11 | 32.02 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 147 | Nova Micro | 11.11 | 10.25 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 148 | Nova Premier | 11.11 | 18.87 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 149 | Solar Pro 2 (Reasoning) | 11.11 | 14.93 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 150 | Solar Pro 2 (Non-reasoning) | 11.11 | 13.53 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 151 | Doubao Seed Code | 11.11 | 33.50 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 152 | GPT-5 (ChatGPT) | 11.11 | 21.83 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 153 | GPT-5.1 (Non-reasoning) | 11.11 | 27.41 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 154 | Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) | 11.11 | 25.51 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 155 | Devstral Medium | 11.11 | 18.62 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 156 | Grok 4 Fast (Non-reasoning) | 11.11 | 22.64 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 157 | Nova Pro | 11.11 | 13.46 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 158 | Nova Lite | 11.11 | 12.45 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 159 | GPT-5 nano (high) | 5.56 | 26.69 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 160 | o3 | 5.56 | 40.91 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 161 | GPT-5 mini (high) | 5.56 | 41.03 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 162 | Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) | 5.56 | 21.60 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 163 | Gemini 3 Pro Preview (high) | 5.56 | 48.44 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 164 | Gemini 2.5 Pro | 5.56 | 34.45 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 165 | Grok 4 | 5.56 | 41.43 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 166 | Grok Code Fast 1 | 5.56 | 28.67 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 167 | Grok 4.1 Fast (Reasoning) | 5.56 | 38.54 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 168 | GPT-5 Codex (high) | 5.56 | 44.52 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 169 | GPT-5 (minimal) | 5.56 | 23.74 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 170 | GPT-5 mini (minimal) | 5.56 | 20.66 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 171 | GPT-5 nano (medium) | 5.56 | 25.68 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 172 | GPT-5 nano (minimal) | 5.56 | 13.65 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 173 | GPT-5.1 (high) | 5.56 | 47.56 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 174 | GPT-5 (low) | 5.56 | 39.03 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 175 | GPT-5 (high) | 5.56 | 44.57 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 176 | GPT-5 mini (medium) | 5.56 | 38.81 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 177 | GPT-5 (medium) | 5.56 | 41.84 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 178 | Gemini 2.5 Flash Preview (Sep '25) (Reasoning) | 5.56 | 31.09 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 179 | Grok 4 Fast (Reasoning) | 5.56 | 34.93 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Explore Evaluations
A composite benchmark aggregating ten challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.
GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop to solve tasks, with ELO ratings derived from blind pairwise comparisons.
A benchmark measuring factual recall and hallucination across various economically relevant domains.
A composite measure providing an industry standard to communicate model openness for users and developers.
An enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas, featuring ten answer options and deeper reasoning requirements.
A lightweight, multilingual version of MMLU, designed to evaluate knowledge and reasoning skills across a diverse range of languages and cultural contexts.
The most challenging 198 questions from GPQA, where PhD experts achieve 65% accuracy but skilled non-experts only reach 34% despite web access.
A frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities, designed to be the final closed-ended academic evaluation.
A contamination-free coding benchmark that continuously harvests fresh competitive programming problems from LeetCode, AtCoder, and CodeForces, evaluating code generation, self-repair, and execution.
A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.
A 500-problem subset from the MATH dataset, featuring competition-level mathematics across six domains including algebra, geometry, and number theory.
A benchmark evaluating precise instruction-following generalization on 58 diverse, verifiable out-of-domain constraints that test models' ability to follow specific output requirements.
All 30 problems from the 2025 American Invitational Mathematics Examination, testing olympiad-level mathematical reasoning with integer answers from 000-999.
A benchmark designed to test LLMs on research-level physics reasoning tasks, featuring 71 composite research challenges.
An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.
A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.
A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).
An enhanced MMMU benchmark that eliminates shortcuts and guessing strategies to more rigorously test multimodal models across 30 academic disciplines.