January 29, 2026
Qwen 3 Max Thinking Benchmarks and Analysis
Alibaba has released Qwen3-Max-Thinking with a significant intelligence upgrade - but the model places behind peers on the Artificial Analysis Intelligence Index, and is not open weights
Qwen3-Max-Thinking scores 40 on the Artificial Analysis Intelligence Index, an 8-point jump from the Preview version (32). It places in line with MiniMax-M2.1 (40) but behind DeepSeek V3.2 (42), GLM-4.7 (42), and Kimi K2.5 (47). Qwen3-Max-Thinking is Alibaba's flagship proprietary reasoning model and performs significantly better than Qwen3 235B 2507 (Thinking, 29)
Key takeaways from our independent benchmarking:
➤ 🧠 Intelligence gains driven by general reasoning and instruction following: The 8-point jump from Preview is largely driven by improvements on general reasoning evaluations. HLE (Humanity's Last Exam) score doubles to 26% from the Preview version, and is ahead of DeepSeek V3.2 (22%), GLM-4.7 (25%), and MiniMax-M2.1 (22%). IFBench score jumps from 54% to 71%, leading peers from Chinese AI labs including Kimi K2.5 (70%) and GLM-4.7 (68%), representing a notable improvement in the model's ability to follow complex instructions. ➤ 🛠️ Improved agentic performance: Qwen3-Max-Thinking achievpes an ELO of 1170 on GDPval-AA, up from Qwen3-Max-Thinking (Preview)'s ELO of 958. This places it ahead of MiniMax-M2.1 (1074) but behind GLM-4.7 (1192), DeepSeek V3.2 (1186), and Kimi K2.5 (1316). GDPval-AA is our leading metric for general agentic performance, measuring models on realistic knowledge work tasks such as preparing presentations and analysis. Models are given shell access and web browsing capabilities in an agentic loop via our open source reference agentic harness called Stirrup
➤ 🪙 Token usage is in line with peers: Qwen3-Max-Thinking generated 86M output tokens (79M reasoning) to complete the Intelligence Index, up from the Preview version's 30M output tokens. This places it in line with Kimi K2.5 (89M) but well below GLM-4.7 (167M)
➤ 📉 AA-Omniscience trails peer models: Qwen3-Max-Thinking scores -34 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This trails Kimi K2.5 (-11), DeepSeek V3.2 (-23), and MiniMax-M2.1 (-30).
Key model information:
➤ 🔒 Proprietary: Like the Preview version, Qwen3-Max-Thinking is proprietary, since Alibaba has not released the weights ➤ ⚙️ Context window: The model supports a 256k-token context window ➤ 📷 No multimodality: It is text-only, with no multimodal inputs or outputs ➤ 💲Pricing: The model is priced at $1.2/$6 per 1M input/output tokens for up to 32K input tokens, scaling to $2.4/$12 for 32K–128K and $3/$15 for 128K–256K ➤ 🌐 Availability: Qwen3-Max-Thinking is currently available in Qwen Chat and via the first-party API on Alibaba Cloud
Intelligence Index
Qwen3-Max-Thinking improves agentic performance compared to Qwen3-Max-Thinking (Preview) and places ahead of MiniMax-M2.1 but behind other peers
GDPval-AA Performance
Qwen3-Max-Thinking used ~86M output tokens to run the Artificial Analysis Intelligence Index, ~56M tokens more than Qwen3-Max-Thinking (Preview)
Intelligence vs Token Usage
Breakdown of full evaluation suite
All Evaluations
See Artificial Analysis for further details and benchmarks of Qwen3-Max-Thinking: https://artificialanalysis.ai/
Want to dive deeper? Discuss this model with our Discord community: https://discord.gg/ATfzv9v9