Comparison of Models: Quality, Performance & Price Analysis

Comparison and analysis of AI models across key metrics including quality, price, performance and speed (throughput tokens per second & latency), context window & others. Click on any model to see detailed metrics. For more details including relating to our methodology, see our FAQs.

Models compared: OpenAI: GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), and GPT-4 Vision, Google: Gemini 1.0 Pro, Gemini 1.5 Pro, and Gemma 7B, Meta: Code Llama (70B), Llama 2 Chat (13B), Llama 2 Chat (70B), Llama 2 Chat (7B), Llama 3 (70B), and Llama 3 (8B), Mistral: Mistral 7B, Mistral Large, Mistral Medium, Mistral Small, Mixtral 8x22B, and Mixtral 8x7B, Anthropic: Claude 2.0, Claude 2.1, Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, and Claude Instant, Cohere: Command, Command Light, Command-R, and Command-R+, Perplexity: PPLX-70B Online and PPLX-7B-Online, xAI: Grok-1, OpenChat: OpenChat 3.5, Microsoft Azure: Phi-3-mini, and Databricks: DBRX.

Model Comparison Summary

Quality:Claude 3 Opus logo Claude 3 Opus  and GPT-4 Vision logo GPT-4 Vision  are the highest quality models, followed by GPT-4 Turbo logo GPT-4 Turbo, GPT-4 logo GPT-4 & Llama 3 (70B) logo Llama 3 (70B).Throughput (tokens/s):Llama 3 (8B) logo Llama 3 (8B) (216 t/s) and Gemma 7B logo Gemma 7B (165 t/s) are the fastest models, followed by Command-R logo Command-R, GPT-3.5 Turbo Instruct logo GPT-3.5 Turbo Instruct & Mixtral 8x7B logo Mixtral 8x7B.Latency (seconds):Command-R logo Command-R (0.15s) and  Command-R+ logo Command-R+ (0.16s) are the lowest latency models, followed by Mistral Small logo Mistral Small, Mistral Medium logo Mistral Medium & Command Light logo Command Light.Blended Price ($/M tokens):Llama 3 (8B) logo Llama 3 (8B) ($0.14) and Gemma 7B logo Gemma 7B ($0.15) are the cheapest models, followed by OpenChat 3.5 logo OpenChat 3.5, Llama 2 Chat (7B) logo Llama 2 Chat (7B) & Mistral 7B logo Mistral 7B.Context Window Size:Gemini 1.5 Pro logo Gemini 1.5 Pro (1m) and Claude 3 Opus logo Claude 3 Opus (200k) are the largest context window models, followed by Claude 3 Sonnet logo Claude 3 Sonnet, Claude 3 Haiku logo Claude 3 Haiku & Claude 2.1 logo Claude 2.1.

Highlights

Quality
Quality Index; Higher is better
Speed
Throughput in Tokens per Second; Higher is better
Price
USD per 1M Tokens; Lower is better
Parallel Queries: (Beta)
Prompt Length:

Quality & Context window

Quality comparison by ability

Varied metrics by ability categorization; Higher is better
General Ability (Chatbot Arena)
Reasoning & Knowledge (MMLU)
Reasoning & Knowledge (MT Bench)
Coding (HumanEval)
Different use-cases warrant considering different evaluation tests. Chatbot Arena is a good evaluation of communication abilities while MMLU tests reasoning and knowledge more comprehensively.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Quality vs. Context window, Input token price

Quality: General reasoning index, Context window: Tokens limit, Input Price: USD per 1M Tokens
Most attractive quadrant
Size represents Input Price (USD per M Tokens)
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Context window

Context window: Tokens limit; Higher is better
Larger context windows are relevant to RAG (Retrieval Augmented Generation) LLM workflows which typically involve reasoning and information retrieval of large amounts of data.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Quality vs. Price

While higher quality models are typically more expensive, they do not all follow the same price-quality curve.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Pricing: Input and Output prices

USD per 1M Tokens
Input price
Output price
Prices vary considerably, including between input and output token price. Prices can vary by orders of magnitude (>10X) between the more expensive and cheapest models.
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Median across providers: Figures represent median (P50) across all providers which support the model.

Performance summary

Quality vs. Throughput, Price

Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens
Most attractive quadrant
Size represents Price (USD per M Tokens)
There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Throughput vs. Price

There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Latency vs. Throughput

Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second
Most attractive quadrant
Size represents Price (USD per M Tokens)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency vs. Throughput: Provider & Model combinations

Latency: Seconds to First Tokens Chunk Received, Throughput: Tokens per Second
Most attractive quadrant
Size represents Price (USD per M Tokens)
GPT-4 Turbo (OpenAI)
GPT-4 Turbo (Azure)
GPT-3.5 Turbo (OpenAI)
GPT-3.5 Turbo (Azure)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Quality vs. Throughput: Provider & Model combinations

Quality: General reasoning index, Throughput: Tokens per Second, Price: USD per 1M Tokens
Most attractive quadrant
Size represents Price (USD per M Tokens)
GPT-4 Turbo (OpenAI)
GPT-4 Turbo (Azure)
GPT-3.5 Turbo (OpenAI)
GPT-3.5 Turbo (Azure)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Speed

Measured by Throughput (tokens per second)

Throughput

Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Median across providers: Figures represent median (P50) across all providers which support the model.

Throughput Variance

Output Tokens per Second; Results by percentile; Higher median is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Boxplot: Shows variance of measurements
Picture of the author

Throughput, Over Time

Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency

Measured by Time (seconds) to First Token

Latency

Seconds to First Tokens Chunk Received; Lower is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency Variance

Seconds to First Tokens Chunk Received; Results by percentile; Lower median is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Latency: Time to first token of tokens received, in seconds, after API request sent.
Boxplot: Shows variance of measurements
Picture of the author

Latency, Over Time

Seconds to First Tokens Chunk Received; Lower median is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time

Time to receive 100 tokens output, calculated by latency and throughput metrics

Total Response Time

Seconds to Output 100 Tokens; Lower is better
The speed difference between the fastest and slowest models is >3X. There is not always a correlation between parameter size and speed, or between price and speed.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time, Over Time

Seconds to Output 100 Tokens; Lower is better
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.
Further details
Model NameFurther analysis
OpenAI logo
OpenAI logoGPT-4
OpenAI logoGPT-4 Turbo
OpenAI logoGPT-4 Turbo (Vision)
OpenAI logoGPT-3.5 Turbo
OpenAI logoGPT-3.5 Turbo Instruct
Meta logo
Meta logoLlama 3 Instruct (70B)
Meta logoLlama 2 Chat (13B)
Meta logoLlama 2 Chat (70B)
Meta logoLlama 3 Instruct (8B)
Meta logoLlama 2 Chat (7B)
Meta logoCode Llama Instruct (70B)
Mistral logo
Mistral logoMistral Large
Mistral logoMistral Medium
Mistral logoMixtral 8x22B Instruct
Mistral logoMixtral 8x7B Instruct
Mistral logoMistral Small
Mistral logoMistral 7B Instruct
Google logo
Google logoGemini 1.5 Pro
Google logoGemini 1.0 Pro
Google logoGemma 7B Instruct
Anthropic logo
Anthropic logoClaude 3 Opus
Anthropic logoClaude 3 Sonnet
Anthropic logoClaude 3 Haiku
Anthropic logoClaude 2.1
Anthropic logoClaude 2.0
Anthropic logoClaude Instant
Cohere logo
Cohere logoCommand-R+
Cohere logoCommand-R
Cohere logoCommand
Cohere logoCommand Light
Databricks logo
Databricks logoDBRX Instruct
OpenChat
OpenChat logoOpenChat 3.5 (1210)
Perplexity logo
Perplexity logoPPLX-70B Online
Perplexity logoPPLX-7B-Online