May 29, 2025

DeepSeek's R1 leaps over xAI, Meta and Anthropic to leave DeepSeek tied as the world's #2 AI Lab and the undisputed open-weights leader

DeepSeek R1 0528 has jumped from 60 to 68 in the Artificial Analysis Intelligence Index, our index of 7 leading evaluations that we run independently across all leading models. That's the same magnitude of increase as the difference between OpenAI's o1 and o3 (62 to 70).

This positions DeepSeek R1 as higher intelligence than xAI's Grok 3 mini (high), NVIDIA's Llama Nemotron Ultra, Meta's Llama 4 Maverick, Alibaba's Qwen 3 253 and equal to Google's Gemini 2.5 Pro.

DeepSeek R1 scores 68 in the Artificial Analysis Intelligence Index

DeepSeek R1 extends China's lead over the US in open weights model intelligence

Breakdown of the R1-0528's improvements:

🧠 Intelligence increases across the board: Biggest jumps seen in AIME 2024 (Competition Math, +21 points), LiveCodeBench (Code generation, +15 points), GPQA Diamond (Scientific Reasoning, +10 points) and Humanity's Last Exam (Reasoning & Knowledge, +6 points)
🏛️ No change to architecture: R1-0528 is a post-training update with no change to the V3/R1 architecture - it remains a large 671B model with 37B active parameters
🧑‍💻 Significant leap in coding skills: R1 is now matching Gemini 2.5 Pro in the Artificial Analysis Coding Index and is behind only o4-mini (high) and o3
🗣️ Increased token usage: R1-0528 used 99 million tokens to complete the evals in Artificial Analysis Intelligence Index, 40% more than the original R1's 71 million tokens - ie. the new R1 thinks for longer than the original R1. This is still not the highest token usage number we have seen: Gemini 2.5 Pro is using 30% more tokens than R1-0528

DeepSeek R1-0528-Qwen3-8B

Alongside the R1-0528 launch, DeepSeek released a distilled 8B model that aims to bring the advanced reasoning capabilities from their flagship R1 model into smaller, more accessible models for on-device deployment. R1-0528-Qwen3-8B was trained on reasoning chain of thought examples from the full-size R1-0528.

DeepSeek R1-0528-Qwen3-8B achieves a score of 52 on the Artificial Analysis Intelligence Index.

DeepSeek R1-0528-Qwen3-8B scores 52 in the Artificial Analysis Intelligence Index

Breakdown of DeepSeek-R1-0528-Qwen3-8B:

🧠 Similar intelligence to Qwen3 8B: DeepSeek's new distill matches the intelligence of Qwen3 8B (Reasoning), Alibaba's post-trained version of the same base model - it scores one point higher in Artificial Analysis Intelligence Index but this is unlikely to translate to noticeable gains in real world use. Unlike Alibaba's hybrid approach to the Qwen3 series, DeepSeek's model does not support inference-time control of whether the model reasons.
📈 Huge leap from DeepSeek R1 (January) distilled models: This distillation achieves equivalent intelligence to original R1 distilled version of Qwen2.5 32B - this means that in just 5 months there is now an 8B model performing as well as Qwen2.5 32B distilled did in January.

DeepSeek R1-0528-Qwen3-8B is the leading model in the 8B category

DeepSeek R1-0528 APIs

We are now tracking 14 APIs for DeepSeek's new R1 model, including DeepSeek's first-party API and offerings from Azure, Fireworks, SambaNova, Lambda Labs, Nebius, Deepinfra, Parasail, Hyperbolic, CentML, Together AI, Novita, GMI Cloud and kluster.ai.

Key info from our benchmarking:

➤ We're seeing the fastest output speeds on Fireworks (~258 tokens/s), SambaNova (~132 tokens/s) and Azure (~100 tokens/s)
➤ We're seeing the best prices on DeepInfra ($0.5/$2.15 per million input/output tokens), followed closely Lambda Labs ($0.5/$2.18) and DeepSeek's own first-party API ($0.55/$2.19)
➤ We're seeing most providers offer a 164k context window. Providers offering this maximum context window include: Lambda Labs, Nebius, Parasail, Hyperbolic, Fireworks Fast, Deepinfra, and kluster.ai. DeepSeek's own first party API only supports a 64k context window.

DeepSeek R1 is a particularly hard model to host compared to most other open weights models because it is so large - at 671B total parameters, it cannot fit on a single 8xH100 node in its native FP8 precision.

Speed vs price for R1-0528 providers

End-to-end response time for R1-0528 providers

Takeaways for AI

👑 The gap between open and closed models is smaller than ever: open weights models have continued to maintain intelligence gains in-line with proprietary models. DeepSeek's R1 release in January was the first time an open-weights model achieved the #2 position and DeepSeek's R1 update today brings it back to the same position
🇨🇳 China remains neck and neck with the US: models from China-based AI Labs have all but completely caught up to their US counterparts, this release continues the emerging trend. As of today, DeepSeek leads US based AI labs including Anthropic and Meta in Artificial Analysis Intelligence Index
🔊 Improvements driven by reinforcement learning: DeepSeek has shown substantial intelligence improvements with the same architecture and pre-train as their original DeepSeek R1 release. This highlights the continually increasing importance of post-training, particularly for reasoning models trained with reinforcement learning (RL) techniques. OpenAI disclosed a 10x scaling of RL compute between o1 and o3 - DeepSeek have just demonstrated that so far, they can keep up with OpenAI's RL compute scaling. Scaling RL demands less compute than scaling pre-training and offers an efficient way of achieving intelligence gains, supporting AI Labs with fewer GPUs

All articles