Comparison of Models: Quality, Performance & Price Analysis

Comparison and analysis of AI models across key performance metrics including quality, price, output speed, latency, context window & others. Click on any model to see detailed metrics. For more details including relating to our methodology, see our FAQs.

Model Comparison Summary

Quality:

GPT-4o and Llama 3.1 405B logo

Llama 3.1 405B are the highest quality models, followed by Claude 3.5 Sonnet logo

Claude 3.5 Sonnet & Llama 3.1 70B logo

Llama 3.1 70B.Output Speed (tokens/s): Mistral NeMo logo

Mistral NeMo (188 t/s) and Gemini 1.5 Flash logo

Gemini 1.5 Flash (166 t/s) are the fastest models, followed by Sonar Small logo

Sonar Small & Llama 3 8B logo

Llama 3 8B.Latency (seconds): Phi-3 Medium 14B logo

Phi-3 Medium 14B (0.21s) and Sonar Small logo

Sonar Small (0.23s) are the lowest latency models, followed by Mistral 7B logo

Mistral 7B & Sonar Large logo

Sonar Large.Price ($ per M tokens): OpenChat 3.5 logo

OpenChat 3.5 ($0.14) and Phi-3 Medium 14B logo

Phi-3 Medium 14B ($0.14) are the cheapest models, followed by Gemma 7B logo

Gemma 7B & Llama 3 8B logo

Llama 3 8B.Context Window: Gemini 1.5 Pro logo

Gemini 1.5 Pro (2m) and Gemini 1.5 Flash logo

Gemini 1.5 Flash (1m) are the largest context window models, followed by Codestral-Mamba logo

Codestral-Mamba & Jamba Instruct logo

Jamba Instruct.

Highlights

Quality

Quality Index; Higher is better

Speed

Output Tokens per Second; Higher is better

Price

USD per 1M Tokens; Lower is better

Navigation

Quality vs. Output Speed, Price

+ Add model from specific provider

Quality: General reasoning index; Output Speed: Output Tokens per Second; Price: Price: USD per 1M Tokens

Most attractive quadrant

Size represents Price (USD per M Tokens)

There is a trade-off between model quality and output speed, with higher quality models typically having lower output speed.

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Quality & Context window

Back to Navigation

Quality comparison by ability

+ Add model from specific provider

Varied metrics by ability categorization; Higher is better

General Ability (Chatbot Arena)

Reasoning & Knowledge (MMLU)

Coding (HumanEval)

Different use-cases warrant considering different evaluation tests. Chatbot Arena is a good evaluation of communication abilities while MMLU tests reasoning and knowledge more comprehensively.

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Output Speed (output tokens per second).

Median across providers: Figures represent median (P50) across all providers which support the model.

Quality vs. Context window, Input token price

+ Add model from specific provider

Quality: General reasoning index; Context window: Tokens limit; Input Price: USD per 1M Input Tokens

Most attractive quadrant

Size represents Input Price (USD per M Input Tokens)

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Context window

+ Add model from specific provider

Context window: Tokens limit; Higher is better

Larger context windows are relevant to RAG (Retrieval Augmented Generation) LLM workflows which typically involve reasoning and information retrieval of large amounts of data.

Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Pricing

Back to Navigation

Quality vs. Price

+ Add model from specific provider

Quality: General reasoning index; Price: Price: USD per 1M Tokens

Most attractive quadrant

While higher quality models are typically more expensive, they do not all follow the same price-quality curve.

Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across providers: Figures represent median (P50) across all providers which support the model.

Pricing: Input and Output prices

+ Add model from specific provider

Price: USD per 1M Tokens

Input price

Output price

Prices vary considerably, including between input and output token price. Prices can vary by orders of magnitude (>10X) between the more expensive and cheapest models.

Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.

Median across providers: Figures represent median (P50) across all providers which support the model.

Performance summary

Back to Navigation

Output Speed vs. Price

+ Add model from specific provider

Output Speed: Output Tokens per Second; Price: Price: USD per 1M Tokens

Most attractive quadrant

There is a trade-off between model quality and output speed, with higher quality models typically having lower output speed.

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Latency vs. Output Speed

+ Add model from specific provider

Latency: Seconds to First Tokens Chunk Received; Output Speed: Output Tokens per Second

Most attractive quadrant

Size represents Price (USD per M Tokens)

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Latency: Time to first token of tokens received, in seconds, after API request sent.

Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Median across providers: Figures represent median (P50) across all providers which support the model.

Speed

Measured by Output Speed (tokens per second)

Back to Navigation

Output Speed

+ Add model from specific provider

Output Tokens per Second; Higher is better

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Median across providers: Figures represent median (P50) across all providers which support the model.

Output Speed by Input token (context) length

+ Add model from specific provider

Output Tokens per Second; Higher is better

Short (100 tokens)

Medium (1,000 tokens)

Long (10,000 tokens)

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.

Median across providers: Figures represent median (P50) across all providers which support the model.

Output Speed Variance

+ Add model from specific provider

Output Tokens per Second; Results by percentile; Higher is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Boxplot: Shows variance of measurements

Output Speed, Over Time

+ Add model from specific provider

Output Tokens per Second; Higher is better

Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Median across providers: Figures represent median (P50) across all providers which support the model.

Latency

Measured by Time (seconds) to First Token

Back to Navigation

Latency

+ Add model from specific provider

Seconds to First Tokens Chunk Received; Lower is better

Latency: Time to first token of tokens received, in seconds, after API request sent.

Median across providers: Figures represent median (P50) across all providers which support the model.

Latency by Input token (context) length

+ Add model from specific provider

Seconds to First Tokens Chunk Received; Lower is better

Short (100 tokens)

Medium (1,000 tokens)

Long (10,000 tokens)

Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.

Latency: Time to first token of tokens received, in seconds, after API request sent.

Median across providers: Figures represent median (P50) across all providers which support the model.

Latency Variance

+ Add model from specific provider

Seconds to First Tokens Chunk Received; Results by percentile; Lower is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

Latency: Time to first token of tokens received, in seconds, after API request sent.

Boxplot: Shows variance of measurements

Latency, Over Time

+ Add model from specific provider

Seconds to First Tokens Chunk Received; Lower median is better

Latency: Time to first token of tokens received, in seconds, after API request sent.

Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time

Time to receive 100 tokens output, calculated by latency and output speed metrics

Back to Navigation

Total Response Time

+ Add model from specific provider

Seconds to Output 100 Tokens; Lower is better

The speed difference between the fastest and slowest models is >3X. There is not always a correlation between parameter size and speed, or between price and speed.

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Output Speed (output tokens per second).

Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time by Input token (context) length

+ Add model from specific provider

Seconds to Output 100 Tokens; Lower is better

Short (100 tokens)

Medium (1,000 tokens)

Long (10,000 tokens)

Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Output Speed (output tokens per second).

Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time Variance

+ Add model from specific provider

Total: Response Time: Seconds to Output 100 Tokens; Results by percentile; Lower is better

Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Output Speed (output tokens per second).

Boxplot: Shows variance of measurements

Total Response Time, Over Time

+ Add model from specific provider

Seconds to Output 100 Tokens; Lower is better

Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Output Speed (output tokens per second).

Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Median across providers: Figures represent median (P50) across all providers which support the model.

Further details

Model Name	Creator	License	Context Window

GPT-4o	OpenAI	Proprietary	128k
GPT-4 Turbo	OpenAI	Proprietary	128k
GPT-4o mini	OpenAI	Proprietary	128k
GPT-4	OpenAI	Proprietary	8k
GPT-3.5 Turbo Instruct	OpenAI	Proprietary	4k
GPT-3.5 Turbo	OpenAI	Proprietary	16k

Gemini 1.5 Pro	Google	Proprietary	2m
Gemini 1.5 Flash	Google	Proprietary	1m
Gemma 2 27B	Google	Open	8k
Gemma 2 9B	Google	Open	8k
Gemini 1.0 Pro	Google	Proprietary	33k
Gemma 7B Instruct	Google	Open	8k

Llama 3.1 Instruct 405B	Meta	Open	128k
Llama 3.1 Instruct 70B	Meta	Open	128k
Llama 3 Instruct 70B	Meta	Open	8k
Llama 3.1 Instruct 8B	Meta	Open	128k
Llama 3 Instruct 8B	Meta	Open	8k
Llama 2 Chat 70B	Meta	Open	4k
Llama 2 Chat 13B	Meta	Open	4k
Llama 2 Chat 7B	Meta	Open	4k

Mistral Large 2	Mistral	Open	128k
Codestral	Mistral	Open	33k
Codestral-Mamba	Mistral	Open	256k
Mistral Large	Mistral	Proprietary	33k
Mixtral 8x22B Instruct	Mistral	Open	65k
Mistral Small	Mistral	Proprietary	33k
Mistral Medium	Mistral	Proprietary	33k
Mistral NeMo	Mistral	Open	128k
Mixtral 8x7B Instruct	Mistral	Open	33k
Mistral 7B Instruct	Mistral	Open	33k

Claude 3.5 Sonnet	Anthropic	Proprietary	200k
Claude 3 Opus	Anthropic	Proprietary	200k
Claude 3 Sonnet	Anthropic	Proprietary	200k
Claude 3 Haiku	Anthropic	Proprietary	200k
Claude 2.0	Anthropic	Proprietary	100k
Claude Instant	Anthropic	Proprietary	100k
Claude 2.1	Anthropic	Proprietary	200k

Command Light	Cohere	Proprietary	4k
Command	Cohere	Proprietary	4k
Command-R+	Cohere	Open	128k
Command-R	Cohere	Open	128k

Sonar Large	Perplexity	Proprietary	33k
Sonar Small	Perplexity	Proprietary	33k

OpenChat 3.5 (1210)	OpenChat	Open	8k

Phi-3 Medium Instruct 14B	Microsoft Azure	Open	128k

DBRX Instruct	Databricks	Open	33k

Reka Core	Reka AI	Proprietary	128k
Reka Flash	Reka AI	Proprietary	128k
Reka Edge	Reka AI	Proprietary	64k

Jamba Instruct	AI21 Labs	Open	256k

DeepSeek-Coder-V2	DeepSeek	Open	128k
DeepSeek-V2-Chat	DeepSeek	Open	128k

Arctic Instruct	Snowflake	Open	4k

Qwen2 Instruct 72B	Alibaba	Open	128k

Yi-Large	01.AI	Proprietary	32k

Models compared: OpenAI: GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), GPT-4 Vision, GPT-4o, and GPT-4o mini, Google: Gemini 1.0 Pro, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemma 2 27B, Gemma 2 9B, and Gemma 7B, Meta: Code Llama 70B, Llama 2 Chat 13B, Llama 2 Chat 70B, Llama 2 Chat 7B, Llama 3 70B, Llama 3 8B, Llama 3.1 405B, Llama 3.1 70B, and Llama 3.1 8B, Mistral: Codestral, Codestral-Mamba, Mistral 7B, Mistral Large, Mistral Large 2, Mistral Medium, Mistral NeMo, Mistral Small, Mixtral 8x22B, and Mixtral 8x7B, Anthropic: Claude 2.0, Claude 2.1, Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, Claude 3.5 Sonnet, and Claude Instant, Cohere: Command, Command Light, Command-R, and Command-R+, Perplexity: PPLX-70B Online, PPLX-7B-Online, Sonar Large, and Sonar Small, xAI: Grok-1, OpenChat: OpenChat 3.5, Microsoft Azure: Phi-3 Medium 14B and Phi-3 Mini, Databricks: DBRX, Reka AI: Reka Core, Reka Edge, and Reka Flash, AI21 Labs: Jamba Instruct, DeepSeek: DeepSeek-Coder-V2 and DeepSeek-V2, Snowflake: Arctic, Alibaba: Qwen2 72B, and 01.AI: Yi-Large.