Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others

Comparison and ranking the performance of over 100 AI models (LLMs) across key metrics including intelligence, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs.

For comparison of API Providers hosting the models see

HIGHLIGHTS

Intelligence:Gemini 3.1 Pro Preview logo Gemini 3.1 Pro Preview and GPT-5.3 Codex (xhigh) logo GPT-5.3 Codex (xhigh) are the highest intelligence models, followed by Claude Opus 4.6 (max) logo Claude Opus 4.6 (max) & Claude Sonnet 4.6 (max) logo Claude Sonnet 4.6 (max).Output Speed (tokens/s):Mercury 2 logo Mercury 2 (620 t/s) and Granite 4.0 H Small logo Granite 4.0 H Small (440 t/s) are the fastest models, followed by Gemini 2.5 Flash-Lite (Sep) logo Gemini 2.5 Flash-Lite (Sep) & Granite 3.3 8B logo Granite 3.3 8B.Latency (seconds):Apriel-v1.5-15B-Thinker logo Apriel-v1.5-15B-Thinker (0.18s) and  LFM2 24B A2B logo LFM2 24B A2B (0.22s) are the lowest latency models, followed by QwQ 32B-Preview logo QwQ 32B-Preview & Apriel-v1.6-15B-Thinker logo Apriel-v1.6-15B-Thinker.Price ($ per M tokens):Gemma 3n E4B logo Gemma 3n E4B ($0.03) and LFM2 24B A2B logo LFM2 24B A2B ($0.05) are the cheapest models, followed by Nova Micro logo Nova Micro & NVIDIA Nemotron Nano 9B V2 logo NVIDIA Nemotron Nano 9B V2.Context Window:Llama 4 Scout logo Llama 4 Scout (10m) and Grok 4.1 Fast logo Grok 4.1 Fast (2m) are the largest context window models, followed by Grok 4.1 Fast logo Grok 4.1 Fast & Gemini 2.0 Pro Experimental logo Gemini 2.0 Pro Experimental.
Parallel Queries:
Prompt Length:

Key definitions

Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Metrics are 'live' and are based on the past 72 hours of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.

Frequently Asked Questions

Gemini 3.1 Pro Preview currently ranks #1 on the Artificial Analysis LLM Leaderboard with an Intelligence Index score of 57, out of 257 models ranked.

The top models by Intelligence Index are: 1. Gemini 3.1 Pro Preview (57), 2. GPT-5.3 Codex (xhigh) (54), 3. Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (53), 4. Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) (52), 5. GPT-5.2 (xhigh) (51).

Mercury 2 is the fastest at 620.2 tokens per second, followed by Granite 4.0 H Small (439.5 t/s) and Gemini 2.5 Flash-Lite Preview (Sep '25) (Reasoning) (410.8 t/s).

Gemma 3n E4B Instruct is the most affordable at $0.03 per 1M tokens (blended 3:1 input-to-output), followed by LFM2 24B A2B ($0.05) and Nova Micro ($0.06).

GLM-5 (Reasoning) is the highest-ranked open weights model with an Intelligence Index score of 50. There are 162 open weights models out of 257 total on the leaderboard.

The top open weights models by Intelligence Index are: 1. GLM-5 (Reasoning) (50), 2. Kimi K2.5 (Reasoning) (47), 3. Qwen3.5 397B A17B (Reasoning) (45).

Gemini 3.1 Pro Preview leads among 132 reasoning models with an Intelligence Index score of 57. Reasoning models use extended thinking to solve complex problems before responding.

The leaderboard includes filters to narrow results by model type (reasoning vs non-reasoning), openness (open weights vs proprietary), and other criteria. You can also adjust prompt options to see how performance varies with different input lengths.

Click on any model name in the leaderboard to visit its dedicated comparison page with detailed charts covering intelligence, pricing, speed, latency, and more. You can also compare API providers for each model. View all models