Follow us on Twitter or LinkedIn to stay up to date with future analysis
All articles

November 25, 2025

Claude Opus 4.5: Benchmarks and Analysis

Anthropic's new Claude Opus 4.5 is the #2 most intelligent model in the Artificial Analysis Intelligence Index, narrowly behind Google's Gemini 3 Pro and tying OpenAI's GPT-5.1 (high)

Claude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 (+7 points on the Artificial Analysis Intelligence Index) and Claude Opus 4.1 (+11 points), establishing it as Anthropic's new leading model.

Anthropic has dramatically cut per-token pricing for Claude Opus 4.5 to $5/$25 per million input/output tokens. However, compared to the prior Claude Opus 4.1 model it used 60% more tokens to complete our Intelligence Index evaluations (48M vs. 30M). This translates to a substantial reduction in the cost to run our Intelligence Index evaluations from $3.1k to $1.5k, but not as significant as the headline price cut implies. Despite Claude Opus 4.5 using substantially more tokens to complete our Intelligence Index, the model still cost significantly more than other models including Gemini 3 Pro (high), GPT-5.1 (high), and Claude Sonnet 4.5 (Thinking), and among all models only cost less than Grok 4 (Reasoning).

Key Benchmarking Takeaways

  • 🧠 Anthropic's most intelligent model: In reasoning mode, Claude Opus 4.5 scores 70 on the Artificial Analysis Intelligence Index. This is a jump of +7 points from Claude Sonnet 4.5 (Thinking), which was released in September 2025, and +11 points from Claude Opus 4.1 (Thinking). Claude Opus 4.5 is now the second most intelligent model. It places ahead of Grok 4 (65) and Kimi K2 Thinking (67), ties GPT-5.1 (high, 70), and trails only Gemini 3 Pro (73). Claude Opus 4.5 (Thinking) scores 5% on CritPt, a frontier physics eval reflective of research assistant capabilities. It sits only behind Gemini 3 Pro (9%) and ties GPT-5.1 (high, 5%)

  • 📈 Largest increases in coding and agentic tasks: Compared to Claude Sonnet 4.5 (Thinking), the biggest uplifts appear across coding, agentic tasks, and long-context reasoning, including LiveCodeBench (+16 p.p.), Terminal-Bench Hard (+11 p.p.), τ2\tau^2-Bench Telecom (+12 p.p.), AA-LCR (+8 p.p.), and Humanity's Last Exam (+11 p.p.). Claude Opus achieves Anthropic's best scores yet across all 10 benchmarks in the Artificial Analysis Intelligence Index. It also earns the highest score on Terminal-Bench Hard (44%) of any model and ties Gemini 3 Pro on MMLU-Pro (90%)

  • 📚 Knowledge and Hallucination: In our recently launched AA-Omniscience Index, which measures embedded knowledge and hallucination of language models, Claude Opus 4.5 places 2nd with a score of 10. It sits only behind Gemini 3 Pro Preview (13) and ahead of Claude Opus 4.1 (Thinking, 5) and GPT-5.1 (high, 2). Claude Opus 4.5 (Thinking) scores the second-highest accuracy (43%) and has the 4th-lowest hallucination rate (58%), trailing only Claude Haiku (Thinking, 26%), Claude Sonnet 4.5 (Thinking, 48%), and GPT-5.1 (high). Claude Opus 4.5 continues to demonstrate Anthropic's leadership in AI safety with a lower hallucination rate than select other frontier models such as Grok 4 and Gemini 3 Pro

  • ⚡ Non-reasoning performance: In non-reasoning mode, Claude Opus 4.5 scores 60 on the Artificial Analysis Intelligence Index and is the most intelligent non-reasoning model. It places ahead of Qwen3 Max (55), Kimi K2 0905 (50), and Claude Sonnet 4.5 (50)

  • ⚙️ Token efficiency: Anthropic continues to demonstrate impressive token efficiency. It has improved intelligence without a significant increase in token usage (compared to Claude Sonnet 4.5, evaluated with a maximum reasoning budget of 64k tokens). Claude Opus 4.5 uses 48M output tokens to run the Artificial Analysis Intelligence Index. This is lower than other frontier models, such as Gemini 3 Pro (high, 92M), GPT-5.1 (high, 81M), and Grok 4 (Reasoning, 120M)

  • 💲 Pricing: Anthropic has reduced the per-token pricing of Claude Opus 4.5 compared to Claude Opus 4.1. Claude Opus 4.5 is priced at $5/$25 per 1M input/output tokens (vs. $15/$75 for Claude Opus 4.1). This positions it much closer to Claude Sonnet 4.5 ($3/$15 per 1M tokens) while offering higher intelligence in thinking mode

Key Model Details

  • 📏 Context window: 200K tokens
  • 🪙 Max output tokens: 64K tokens
  • 🌐 Availability: Claude Opus 4.5 is available via Anthropic's API, Google Vertex, Amazon Bedrock and Microsoft Azure. Claude Opus 4.5 is also available via Claude app and Claude Code

Claude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 and Claude Opus 4.1, placing it as the #2 most intelligent model in the Artificial Analysis Intelligence IndexClaude Opus 4.5 delivers a substantial intelligence uplift over Claude Sonnet 4.5 and Claude Opus 4.1, placing it as the #2 most intelligent model in the Artificial Analysis Intelligence Index

A key differentiator for the Claude models remains that they are substantially more token-efficient than all other reasoning models. Claude Opus 4.5 has significantly increased intelligence without a large increase in output tokens, differing substantially from other model families that rely on greater reasoning at inference time (i.e., more output tokens). On the Output Tokens Used in Artificial Analysis Intelligence Index vs Intelligence Index chart, Claude 4.5 Opus (Thinking) sits on the Pareto frontier.

Claude Opus 4.5 has significantly increased intelligence without a large increase in output tokensClaude Opus 4.5 has significantly increased intelligence without a large increase in output tokens

This output token efficiency contributes to Claude Opus 4.5 (in Thinking mode) offering a better tradeoff between intelligence and cost to run the Artificial Analysis Intelligence Index than Claude Opus 4.1 (Thinking) and Grok 4 (Reasoning).

Claude Opus 4.5 offers a better tradeoff between intelligence and cost than Claude Opus 4.1 and Grok 4Claude Opus 4.5 offers a better tradeoff between intelligence and cost than Claude Opus 4.1 and Grok 4

Claude Opus 4.5 (Thinking) takes the #2 spot on the Artificial Analysis Omniscience Index, our new benchmark for measuring knowledge and hallucination across domains. Claude Opus 4.5 (Thinking) comes in second for both Omniscience Index (our lead metric that takes off points for incorrect answers) and Omniscience Accuracy (percentage correct), offering a balance of high accuracy and low hallucination rate compared to peer models.

Claude Opus 4.5 (Thinking) takes the #2 spot on the Artificial Analysis Omniscience IndexClaude Opus 4.5 (Thinking) takes the #2 spot on the Artificial Analysis Omniscience Index

Individual results across all benchmarks in our Artificial Analysis Intelligence Index. We have run all these benchmarks independently and like-for-like across all models.

Independent evaluation results across all benchmarks in our Artificial Analysis Intelligence IndexIndependent evaluation results across all benchmarks in our Artificial Analysis Intelligence Index

While Claude 4.5 Opus is significantly more token efficient than nearly all other reasoning models, it did use ~50% more tokens than Claude 4.1 Opus. Further, given its relatively high pricing, Claude 4.5 Opus is amongst the most expensive to run the Artificial Analysis Intelligence Index at ~$1.5k.

Claude Opus 4.5 is amongst the most expensive to run the Artificial Analysis Intelligence Index, despite its token efficiency due to higher pricingClaude Opus 4.5 is amongst the most expensive to run the Artificial Analysis Intelligence Index, despite its token efficiency due to higher pricing

Compare Claude Opus 4.5 to other models on Artificial Analysis:

https://artificialanalysis.ai/models/claude-opus-4-5-thinking