Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others

Comparison and ranking the performance of over 100 AI models (LLMs) across key metrics including intelligence, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs.

For comparison of API Providers hosting the models see

HIGHLIGHTS

Intelligence:Gemini 3.1 Pro Preview logo Gemini 3.1 Pro Preview and GPT-5.3 Codex (xhigh) logo GPT-5.3 Codex (xhigh) are the highest intelligence models, followed by Claude Opus 4.6 (max) logo Claude Opus 4.6 (max) & Claude Sonnet 4.6 (max) logo Claude Sonnet 4.6 (max).Output Speed (tokens/s):Mercury 2 logo Mercury 2 (703 t/s) and Gemini 3.1 Flash-Lite Preview logo Gemini 3.1 Flash-Lite Preview (379 t/s) are the fastest models, followed by Granite 4.0 H Small logo Granite 4.0 H Small & Gemini 2.5 Flash-Lite (Sep) logo Gemini 2.5 Flash-Lite (Sep).Latency (seconds):DeepSeek R1 Distill Qwen 32B logo DeepSeek R1 Distill Qwen 32B (0.38s) and  NVIDIA Nemotron 3 Nano logo NVIDIA Nemotron 3 Nano (0.38s) are the lowest latency models, followed by Apriel-v1.5-15B-Thinker logo Apriel-v1.5-15B-Thinker & QwQ 32B-Preview logo QwQ 32B-Preview.Price ($ per M tokens):Gemma 3n E4B logo Gemma 3n E4B ($0.03) and LFM2 24B A2B logo LFM2 24B A2B ($0.05) are the cheapest models, followed by Nova Micro logo Nova Micro & NVIDIA Nemotron Nano 9B V2 logo NVIDIA Nemotron Nano 9B V2.Context Window:Llama 4 Scout logo Llama 4 Scout (10m) and Grok 4.1 Fast logo Grok 4.1 Fast (2m) are the largest context window models, followed by Grok 4.1 Fast logo Grok 4.1 Fast & Gemini 2.0 Pro Experimental logo Gemini 2.0 Pro Experimental.

Key definitions

Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Metrics are 'live' and are based on the past 72 hours of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.

Frequently Asked Questions

Gemini 3.1 Pro Preview currently ranks #1 on the Artificial Analysis LLM Leaderboard with an Intelligence Index score of 57, out of 255 models ranked.

The top models by Intelligence Index are: 1. Gemini 3.1 Pro Preview (57), 2. GPT-5.3 Codex (xhigh) (54), 3. Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (53), 4. Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) (52), 5. GPT-5.2 (xhigh) (51).

Mercury 2 is the fastest at 702.5 tokens per second, followed by Gemini 3.1 Flash-Lite Preview (379.4 t/s) and Granite 4.0 H Small (343.8 t/s).

Gemma 3n E4B Instruct is the most affordable at $0.03 per 1M tokens (blended 3:1 input-to-output), followed by LFM2 24B A2B ($0.05) and Nova Micro ($0.06).

GLM-5 (Reasoning) is the highest-ranked open weights model with an Intelligence Index score of 50. There are 160 open weights models out of 255 total on the leaderboard.

The top open weights models by Intelligence Index are: 1. GLM-5 (Reasoning) (50), 2. Kimi K2.5 (Reasoning) (47), 3. Qwen3.5 397B A17B (Reasoning) (45).

Gemini 3.1 Pro Preview leads among 132 reasoning models with an Intelligence Index score of 57. Reasoning models use extended thinking to solve complex problems before responding.

The leaderboard includes filters to narrow results by model type (reasoning vs non-reasoning), openness (open weights vs proprietary), and other criteria. You can also adjust prompt options to see how performance varies with different input lengths.

Click on any model name in the leaderboard to visit its dedicated comparison page with detailed charts covering intelligence, pricing, speed, latency, and more. You can also compare API providers for each model. View all models