Comparison of Models: Quality, Performance & Price Analysis

Comparison and analysis of AI models across key metrics including quality, price, performance and speed (throughput tokens per second & latency), context window & others. Click on any model to see detailed metrics. For more details including relating to our methodology, see our FAQs.

Model Comparison Summary

Quality:GPT-4o logo GPT-4o  and GPT-4 Turbo logo GPT-4 Turbo  are the highest quality models, followed by Claude 3 Opus logo Claude 3 Opus & Llama 3 (70B) logo Llama 3 (70B).Throughput (tokens/s):Gemma 7B logo Gemma 7B (160 t/s) and Gemini 1.5 Flash logo Gemini 1.5 Flash (141 t/s) are the fastest models, followed by Llama 3 (8B) logo Llama 3 (8B) & GPT-3.5 Turbo Instruct logo GPT-3.5 Turbo Instruct.Latency (seconds):Mistral 7B logo Mistral 7B (0.23s) and  Mistral Medium logo Mistral Medium (0.24s) are the lowest latency models, followed by Mixtral 8x7B logo Mixtral 8x7B & Mixtral 8x22B logo Mixtral 8x22B.Price ($ per M tokens):Gemma 7B logo Gemma 7B ($0.15) and OpenChat 3.5 logo OpenChat 3.5 ($0.17) are the cheapest models, followed by DeepSeek-V2 logo DeepSeek-V2 & Llama 3 (8B) logo Llama 3 (8B).Context Window:Gemini 1.5 Flash logo Gemini 1.5 Flash (1m) and Gemini 1.5 Pro logo Gemini 1.5 Pro (1m) are the largest context window models, followed by Claude 3 Opus logo Claude 3 Opus & Claude 3 Sonnet logo Claude 3 Sonnet.

Highlights

Quality
Quality Index; Higher is better
Speed
Throughput in Tokens per Second; Higher is better
Price
USD per 1M Tokens; Lower is better
Parallel Queries: (Beta)
Prompt Length:

Quality vs. Throughput, Price

+ Add model from specific provider
Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens
Most attractive quadrant
Size represents Price (USD per M Tokens)
There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Quality & Context window

Quality comparison by ability

+ Add model from specific provider
Varied metrics by ability categorization; Higher is better
General Ability (Chatbot Arena)
Reasoning & Knowledge (MMLU)
Reasoning & Knowledge (MT Bench)
Coding (HumanEval)
Different use-cases warrant considering different evaluation tests. Chatbot Arena is a good evaluation of communication abilities while MMLU tests reasoning and knowledge more comprehensively.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Quality vs. Context window, Input token price

+ Add model from specific provider
Quality: General reasoning index, Context window: Tokens limit, Input Price: USD per 1M Tokens
Most attractive quadrant
Size represents Input Price (USD per M Tokens)
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Context window

+ Add model from specific provider
Context window: Tokens limit; Higher is better
Larger context windows are relevant to RAG (Retrieval Augmented Generation) LLM workflows which typically involve reasoning and information retrieval of large amounts of data.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Quality vs. Price

+ Add model from specific provider
While higher quality models are typically more expensive, they do not all follow the same price-quality curve.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Pricing: Input and Output prices

+ Add model from specific provider
USD per 1M Tokens
Input price
Output price
Prices vary considerably, including between input and output token price. Prices can vary by orders of magnitude (>10X) between the more expensive and cheapest models.
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Median across providers: Figures represent median (P50) across all providers which support the model.

Performance summary

Throughput vs. Price

+ Add model from specific provider
There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Latency vs. Throughput

+ Add model from specific provider
Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second
Most attractive quadrant
Size represents Price (USD per M Tokens)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Speed

Measured by Throughput (tokens per second)

Throughput

+ Add model from specific provider
Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Median across providers: Figures represent median (P50) across all providers which support the model.

Throughput Variance

+ Add model from specific provider
Output Tokens per Second; Results by percentile; Higher median is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Boxplot: Shows variance of measurements
Picture of the author

Throughput, Over Time

+ Add model from specific provider
Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency

Measured by Time (seconds) to First Token

Latency

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Lower is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency Variance

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Results by percentile; Lower median is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Latency: Time to first token of tokens received, in seconds, after API request sent.
Boxplot: Shows variance of measurements
Picture of the author

Latency, Over Time

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Lower median is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time

Time to receive 100 tokens output, calculated by latency and throughput metrics

Total Response Time

+ Add model from specific provider
Seconds to Output 100 Tokens; Lower is better
The speed difference between the fastest and slowest models is >3X. There is not always a correlation between parameter size and speed, or between price and speed.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time, Over Time

+ Add model from specific provider
Seconds to Output 100 Tokens; Lower is better
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.
Further details
Model NameFurther analysis
OpenAI logo
OpenAI logoGPT-4o
OpenAI logoGPT-4 Turbo
OpenAI logoGPT-4
OpenAI logoGPT-3.5 Turbo
OpenAI logoGPT-3.5 Turbo Instruct
Google logo
Google logoGemini 1.5 Flash
Google logoGemini 1.5 Pro
Google logoGemini 1.0 Pro
Google logoGemma 7B Instruct
Meta logo
Meta logoLlama 3 Instruct (70B)
Meta logoLlama 3 Instruct (8B)
Meta logoCode Llama Instruct (70B)
Meta logoLlama 2 Chat (70B)
Meta logoLlama 2 Chat (13B)
Meta logoLlama 2 Chat (7B)
Mistral logo
Mistral logoMixtral 8x22B Instruct
Mistral logoMistral Large
Mistral logoMistral Medium
Mistral logoMistral Small
Mistral logoMixtral 8x7B Instruct
Mistral logoMistral 7B Instruct
Anthropic logo
Anthropic logoClaude 3 Opus
Anthropic logoClaude 3 Sonnet
Anthropic logoClaude 3 Haiku
Anthropic logoClaude 2.0
Anthropic logoClaude 2.1
Anthropic logoClaude Instant
Cohere logo
Cohere logoCommand Light
Cohere logoCommand
Cohere logoCommand-R+
Cohere logoCommand-R
Perplexity logo
Perplexity logoPPLX-70B Online
Perplexity logoPPLX-7B-Online
OpenChat logo
OpenChat logoOpenChat 3.5 (1210)
Databricks logo
Databricks logoDBRX Instruct
DeepSeek logo
DeepSeek logoDeepSeek-V2-Chat
Snowflake logo
Snowflake logoArctic Instruct

Models compared: OpenAI: GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), GPT-4 Vision, and GPT-4o, Google: Gemini 1.0 Pro, Gemini 1.5 Flash, Gemini 1.5 Pro, and Gemma 7B, Meta: Code Llama (70B), Llama 2 Chat (13B), Llama 2 Chat (70B), Llama 2 Chat (7B), Llama 3 (70B), and Llama 3 (8B), Mistral: Mistral 7B, Mistral Large, Mistral Medium, Mistral Small, Mixtral 8x22B, and Mixtral 8x7B, Anthropic: Claude 2.0, Claude 2.1, Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, and Claude Instant, Cohere: Command, Command Light, Command-R, and Command-R+, Perplexity: PPLX-70B Online and PPLX-7B-Online, xAI: Grok-1, OpenChat: OpenChat 3.5, Microsoft Azure: Phi-3-mini, Databricks: DBRX, DeepSeek: DeepSeek-V2, and Snowflake: Arctic.