LLM Leaderboard - Comparison of GPT-4o, Llama 3, Mistral, Gemini and over 30 models

Comparison and ranking the performance of over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others. For more details including relating to our methodology, see our FAQs.

For comparison of API Providers hosting the models see

HIGHLIGHTS

Quality:GPT-4o logo GPT-4o and Llama 3.1 405B logo Llama 3.1 405B are the highest quality models, followed by Claude 3.5 Sonnet logo Claude 3.5 Sonnet & Llama 3.1 70B logo Llama 3.1 70B.Output Speed (tokens/s):Mistral NeMo logo Mistral NeMo (188 t/s) and Gemini 1.5 Flash logo Gemini 1.5 Flash (166 t/s) are the fastest models, followed by Sonar Small logo Sonar Small & Llama 3 8B logo Llama 3 8B.Latency (seconds):Phi-3 Medium 14B logo Phi-3 Medium 14B (0.21s) and  Sonar Small logo Sonar Small (0.23s) are the lowest latency models, followed by Mistral 7B logo Mistral 7B & Sonar Large logo Sonar Large.Price ($ per M tokens):OpenChat 3.5 logo OpenChat 3.5 ($0.14) and Phi-3 Medium 14B logo Phi-3 Medium 14B ($0.14) are the cheapest models, followed by Gemma 7B logo Gemma 7B & Llama 3 8B logo Llama 3 8B.Context Window:Gemini 1.5 Pro logo Gemini 1.5 Pro (2m) and Gemini 1.5 Flash logo Gemini 1.5 Flash (1m) are the largest context window models, followed by Codestral-Mamba logo Codestral-Mamba & Jamba Instruct logo Jamba Instruct.
Context
Quality
Price
Output tokens/s
Latency
Further
Analysis
GPT-4o
OpenAI logo
128k
100
$7.50
82.3
0.45
GPT-4 Turbo
OpenAI logo
128k
94
$15.00
29.2
0.60
GPT-4o mini
OpenAI logo
128k
88
$0.26
98.9
0.55
GPT-4
OpenAI logo
8k
84
$37.50
25.5
0.67
GPT-3.5 Turbo Instruct
OpenAI logo
4k
60
$1.63
116.3
0.53
GPT-3.5 Turbo
OpenAI logo
16k
59
$0.75
83.2
0.37
Gemini 1.5 Pro
Google logo
2m
95
$5.25
57.7
1.07
Gemini 1.5 Flash
Google logo
1m
84
$0.53
166.2
1.02
Gemma 2 27B
Google logo
8k
78
$0.80
76.9
0.49
Gemma 2 9B
Google logo
8k
71
$0.20
119.9
0.29
Gemini 1.0 Pro
Google logo
33k
62
$0.75
87.0
2.10
Gemma 7B
Google logo
8k
45
$0.15
147.5
0.33
Llama 3.1 405B
Meta logo
128k
100
$6.50
26.9
0.61
Llama 3.1 70B
Meta logo
128k
95
$0.88
57.6
0.43
Llama 3 70B
Meta logo
8k
83
$0.90
63.5
0.45
Llama 3.1 8B
Meta logo
128k
66
$0.18
147.4
0.30
Llama 3 8B
Meta logo
8k
64
$0.17
148.1
0.32
Llama 2 Chat 70B
Meta logo
4k
57
$1.00
49.7
0.80
Mistral Large 2
Mistral logo
128k
91
$4.50
30.4
0.44
Llama 2 Chat 13B
Meta logo
4k
39
$0.25
83.7
0.47
Llama 2 Chat 7B
Meta logo
4k
29
$0.20
91.7
1.04
Codestral
Mistral logo
33k
$1.50
53.2
0.31
Codestral-Mamba
Mistral logo
256k
$0.25
95.6
0.43
Mistral Large
Mistral logo
33k
76
$6.00
34.6
0.56
Mixtral 8x22B
Mistral logo
65k
71
$1.20
66.1
0.32
Mistral Small
Mistral logo
33k
71
$1.50
56.4
0.95
Mistral Medium
Mistral logo
33k
70
$4.05
37.9
0.66
Mistral NeMo
Mistral logo
128k
64
$0.30
188.0
0.32
Mixtral 8x7B
Mistral logo
33k
61
$0.50
89.0
0.34
Mistral 7B
Mistral logo
33k
40
$0.18
104.7
0.29
Claude 3.5 Sonnet
Anthropic logo
200k
98
$6.00
78.7
1.14
Claude 3 Opus
Anthropic logo
200k
93
$30.00
25.6
1.95
Claude 3 Sonnet
Anthropic logo
200k
80
$6.00
63.3
0.92
Claude 3 Haiku
Anthropic logo
200k
74
$0.50
129.7
0.53
Claude 2.0
Anthropic logo
100k
70
$12.00
39.9
1.13
Claude Instant
Anthropic logo
100k
63
$1.20
84.9
0.58
Claude 2.1
Anthropic logo
200k
55
$12.00
38.7
1.51
Command Light
Cohere logo
4k
$0.38
37.1
0.48
Command
Cohere logo
4k
$1.44
23.9
0.45
Command-R+
Cohere logo
128k
75
$6.00
60.9
0.47
Command-R
Cohere logo
128k
63
$0.75
124.0
0.40
Sonar Large
Perplexity logo
33k
$1.00
54.8
0.29
Sonar Small
Perplexity logo
33k
$0.20
157.2
0.23
OpenChat 3.5
OpenChat logo
8k
50
$0.14
68.8
0.35
Phi-3 Medium 14B
Microsoft Azure logo
128k
$0.14
74.0
0.21
DBRX
Databricks logo
33k
62
$1.20
82.3
0.40
Reka Core
Reka AI logo
128k
90
$6.00
15.8
1.34
Reka Flash
Reka AI logo
128k
78
$1.10
31.2
0.84
Reka Edge
Reka AI logo
64k
60
$0.55
48.9
0.84
Jamba Instruct
AI21 Labs logo
256k
63
$0.55
66.8
0.48
DeepSeek-Coder-V2
DeepSeek logo
128k
$0.17
16.5
1.24
DeepSeek-V2
DeepSeek logo
128k
82
$0.17
16.8
1.15
Arctic
Snowflake logo
4k
55
$2.40
72.0
0.62
Qwen2 72B
Alibaba logo
128k
83
$0.90
49.7
0.35
Yi-Large
01.AI logo
32k
81
$3.00
73.9
0.36

Key definitions

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Time period: Metrics are 'live' and are based on the past 14 days of measurements, measurements are taken 8 times a day for single requests and 2 times per day for parallel requests.

Models compared: OpenAI: GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), GPT-4 Vision, GPT-4o, and GPT-4o mini, Google: Gemini 1.0 Pro, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemma 2 27B, Gemma 2 9B, and Gemma 7B, Meta: Code Llama 70B, Llama 2 Chat 13B, Llama 2 Chat 70B, Llama 2 Chat 7B, Llama 3 70B, Llama 3 8B, Llama 3.1 405B, Llama 3.1 70B, and Llama 3.1 8B, Mistral: Codestral, Codestral-Mamba, Mistral 7B, Mistral Large, Mistral Large 2, Mistral Medium, Mistral NeMo, Mistral Small, Mixtral 8x22B, and Mixtral 8x7B, Anthropic: Claude 2.0, Claude 2.1, Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, Claude 3.5 Sonnet, and Claude Instant, Cohere: Command, Command Light, Command-R, and Command-R+, Perplexity: PPLX-70B Online, PPLX-7B-Online, Sonar Large, and Sonar Small, xAI: Grok-1, OpenChat: OpenChat 3.5, Microsoft Azure: Phi-3 Medium 14B and Phi-3 Mini, Databricks: DBRX, Reka AI: Reka Core, Reka Edge, and Reka Flash, AI21 Labs: Jamba Instruct, DeepSeek: DeepSeek-Coder-V2 and DeepSeek-V2, Snowflake: Arctic, Alibaba: Qwen2 72B, and 01.AI: Yi-Large.