logo

Llama 3 Instruct (8B): Quality, Performance & Price Analysis

Analysis of Meta's Llama 3 Instruct (8B) and comparison to other AI models across key metrics including quality, price, performance (tokens per second & time to first token), context window & more. Click on any model to compare API providers for that model. For more details including relating to our methodology, see our FAQs.
For analysis of API providers see
Creator:
Meta
License:
Open
Context window:
8k
Link:

Comparison Summary

Quality:
Llama 3 (8B) is of lower quality compared to average, with a MMLU score of 0.684 and a Quality Index across evaluations of 65.
Price:
Llama 3 (8B) is cheaper compared to average with a price of $0.20 per 1M Tokens (blended 3:1).
Llama 3 (8B) Input token price: $0.20, Output token price: $0.20 per 1M Tokens.
Speed:
Llama 3 (8B) is faster compared to average, with a throughput of 121.2 tokens per second.
Latency:
Llama 3 (8B) has a lower latency compared to average, taking 0.26s to receive the first token (TTFT).
Context Window:
Llama 3 (8B) has a smaller context windows than average, with a context window of 8.2k tokens.

Highlights

Quality
Quality Index; Higher is better
Speed
Throughput in Tokens per Second; Higher is better
Price
USD per 1M Tokens; Lower is better
Parallel Queries: (Beta)
Prompt Length:
Note: Long prompts not supported as a context window of at least 10k tokens is required

Quality vs. Throughput, Price

+ Add model from specific provider
Quality: General reasoning index; Throughput: Output Tokens per Second; Price: Price: USD per 1M Tokens
Most attractive quadrant
Size represents Price (USD per M Tokens)
There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Quality & Context window

Quality comparison by ability

+ Add model from specific provider
Varied metrics by ability categorization; Higher is better
General Ability (Chatbot Arena)
Reasoning & Knowledge (MMLU)
Reasoning & Knowledge (MT Bench)
Coding (HumanEval)
Different use-cases warrant considering different evaluation tests. Chatbot Arena is a good evaluation of communication abilities while MMLU tests reasoning and knowledge more comprehensively.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Quality vs. Context window, Input token price

+ Add model from specific provider
Quality: General reasoning index; Context window: Tokens limit; Input Price: USD per 1M Input Tokens
Most attractive quadrant
Size represents Input Price (USD per M Input Tokens)
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Context window

+ Add model from specific provider
Context window: Tokens limit; Higher is better
Larger context windows are relevant to RAG (Retrieval Augmented Generation) LLM workflows which typically involve reasoning and information retrieval of large amounts of data.
Context window: Maximum number of combined input & output tokens. Output tokens commonly have a significantly lower limit (varied by model).

Quality vs. Price

+ Add model from specific provider
While higher quality models are typically more expensive, they do not all follow the same price-quality curve.
Quality: Index represents normalized average relative performance across Chatbot arena, MMLU & MT-Bench.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Pricing: Input and Output prices

+ Add model from specific provider
Price: USD per 1M Tokens
Input price
Output price
Prices vary considerably, including between input and output token price. Prices can vary by orders of magnitude (>10X) between the more expensive and cheapest models.
Input price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Output price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Median across providers: Figures represent median (P50) across all providers which support the model.

 Pricing comparison of Llama 3 Instruct (8B) API providers

Performance summary

Throughput vs. Price

+ Add model from specific provider
There is a trade-off between model quality and throughput, with higher quality models typically having lower throughput.
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Latency vs. Throughput

+ Add model from specific provider
Latency: Seconds to First Tokens Chunk Received; Throughput: Output Tokens per Second
Most attractive quadrant
Size represents Price (USD per M Tokens)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Latency: Time to first token of tokens received, in seconds, after API request sent.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Median across providers: Figures represent median (P50) across all providers which support the model.

Speed

Measured by Throughput (tokens per second)

Throughput

+ Add model from specific provider
Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Median across providers: Figures represent median (P50) across all providers which support the model.

Throughput by Input token (context) length

+ Add model from specific provider
Output Tokens per Second; Higher is better
Short (100 tokens)
Medium (1,000 tokens)
Long (10,000 tokens)
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.
Median across providers: Figures represent median (P50) across all providers which support the model.

Throughput Variance

+ Add model from specific provider
Output Tokens per Second; Results by percentile; Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Boxplot: Shows variance of measurements
Picture of the author

Throughput, Over Time

+ Add model from specific provider
Output Tokens per Second; Higher is better
Throughput: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency

Measured by Time (seconds) to First Token

Latency

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Lower is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency by Input token (context) length

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Lower is better
Short (100 tokens)
Medium (1,000 tokens)
Long (10,000 tokens)
Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.
Latency: Time to first token of tokens received, in seconds, after API request sent.
Median across providers: Figures represent median (P50) across all providers which support the model.

Latency Variance

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Results by percentile; Lower is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Latency: Time to first token of tokens received, in seconds, after API request sent.
Boxplot: Shows variance of measurements
Picture of the author

Latency, Over Time

+ Add model from specific provider
Seconds to First Tokens Chunk Received; Lower median is better
Latency: Time to first token of tokens received, in seconds, after API request sent.
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time

Time to receive 100 tokens output, calculated by latency and throughput metrics

Total Response Time

+ Add model from specific provider
Seconds to Output 100 Tokens; Lower is better
The speed difference between the fastest and slowest models is >3X. There is not always a correlation between parameter size and speed, or between price and speed.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time by Input token (context) length

+ Add model from specific provider
Seconds to Output 100 Tokens; Lower is better
Short (100 tokens)
Medium (1,000 tokens)
Long (10,000 tokens)
Input Tokens Length: Length of tokens provided in the request. See Prompt Options above to see benchmarks of different input prompt lengths across other charts.
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Median across providers: Figures represent median (P50) across all providers which support the model.

Total Response Time Variance

+ Add model from specific provider
Total: Response Time: Seconds to Output 100 Tokens; Results by percentile; Lower is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Boxplot: Shows variance of measurements
Picture of the author

Total Response Time, Over Time

+ Add model from specific provider
Seconds to Output 100 Tokens; Lower is better
Total Response Time: Time to receive a 100 token response. Estimated based on Latency (time to receive first chunk) and Throughput (tokens per second).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Median across providers: Figures represent median (P50) across all providers which support the model.
Further details
Model NameFurther analysis
OpenAI logo
OpenAI logoGPT-4o
OpenAI logoGPT-4 Turbo
OpenAI logoGPT-4
OpenAI logoGPT-3.5 Turbo
OpenAI logoGPT-3.5 Turbo Instruct
Google logo
Google logoGemini 1.5 Flash
Google logoGemini 1.5 Pro
Google logoGemini 1.0 Pro
Google logoGemma 7B Instruct
Meta logo
Meta logoLlama 3 Instruct (70B)
Meta logoLlama 3 Instruct (8B)
Meta logoCode Llama Instruct (70B)
Meta logoLlama 2 Chat (70B)
Meta logoLlama 2 Chat (13B)
Meta logoLlama 2 Chat (7B)
Mistral logo
Mistral logoMixtral 8x22B Instruct
Mistral logoMistral Large
Mistral logoMistral Medium
Mistral logoMistral Small
Mistral logoMixtral 8x7B Instruct
Mistral logoMistral 7B Instruct
Anthropic logo
Anthropic logoClaude 3 Opus
Anthropic logoClaude 3 Sonnet
Anthropic logoClaude 3 Haiku
Anthropic logoClaude 2.0
Anthropic logoClaude 2.1
Anthropic logoClaude Instant
Cohere logo
Cohere logoCommand Light
Cohere logoCommand
Cohere logoCommand-R+
Cohere logoCommand-R
Perplexity logo
Perplexity logoPPLX-70B Online
Perplexity logoPPLX-7B-Online
OpenChat logo
OpenChat logoOpenChat 3.5 (1210)
Databricks logo
Databricks logoDBRX Instruct
DeepSeek logo
DeepSeek logoDeepSeek-V2-Chat
Snowflake logo
Snowflake logoArctic Instruct