May 28, 2026
Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index
Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning.
To reach the leading position on the Intelligence Index, Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks.
Key takeaways
- Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader.
- The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance.
- Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5.
- Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity's Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5.
- Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index, behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9%. Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI.
- Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ²-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode.
Other key model details remain the same as Opus 4.7
- Context window of 1 million tokens (equivalent to Opus 4.7)
- Pricing of $5/$25 per million tokens of input/output; cache pricing remains at a 25% premium for cache writes ($6.25 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.5 per million tokens)
- Effort remains the recommended way of configuring model performance and latency, with the same options as Opus 4.7. We measured the model at its 'max' effort setting to test peak performance.

Intelligence vs output tokens
Across the overall Intelligence Index, Claude Opus 4.8 used approximately the same number of output tokens as Opus 4.7, but made substantial improvements in performance on a range of benchmarks across domains to achieve an Artificial Analysis Intelligence Index increase of 4 points.

Scientific and academic reasoning
Claude Opus 4.8 makes large strides in scientific and academic reasoning capabilities. It leads Humanity's Last Exam by 1 point in a tight contest between Anthropic, Google DeepMind, and OpenAI, and Claude Opus has overtaken Gemini 3.1 Pro on CritPt, a frontier physics evaluation developed by Argonne and UIUC.

GDPval-AA
Opus 4.8 scored 1,890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh. Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh. It achieves this performance in 15% fewer turns per task and with 35% fewer output tokens than Opus 4.7. However, it still uses approximately 30% more turns than OpenAI's GPT-5.5, the second-ranked model.

Full results
Breakdown of full results for Claude Opus 4.8.

Further Benchmarks
Compare Opus 4.8 with other leading models at: https://artificialanalysis.ai/models/claude-opus-4-8
Read the latest

MiniCPM5-1B: The leading 1B open weights model
OpenBMB has released MiniCPM5-1B (Non-reasoning), the leading 1B open weights model scoring 17.9 on the Artificial Analysis Intelligence Index
May 26, 2026

Cursor’s Composer 2.5: third on the Coding Agent Index and ~10-60x lower cost than rivals
This release puts Composer among the leading coding agent models, something that wasn’t clear for past releases
May 21, 2026

Cohere launches open weights model Command A+, more than a year since the Command A release
Benchmarks and analysis of Command A+
May 21, 2026