Methodology
Scope
Artificial Analysis performs benchmarking on large language model (LLM) inference delivered via serverless API endpoints. This page describes our LLM benchmarking methodology, including both our quality benchmarking and performance benchmarking.
We consider LLM endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to a system. Typically this means that endpoints are priced on a per token basis, often with different prices for input and output tokens.
Our performance benchmarking measures the end-to-end performance experienced by customers of LLM inference services. This means that benchmark results are not intended to represent the maximum possible performance on any particular hardware platform, they are intended to represent the real-world performance customers experience across providers.
We benchmark both proprietary and open weights models.
Definitions
On this page, and across the Artificial Analysis website, we use the following terms:
- Model: A large language model (LLM), including proprietary, open source and open weights models.
- Model Creator: The organization that developed and trained the model. For example, OpenAI is the creator of GPT-4 and Meta is the creator of Llama 3.
- Endpoint: A hosted instance of a model that can be accessed via an API. A single model may have multiple endpoints across different providers.
- Provider: A company that hosts and provides access to one or more model endpoints via an API. Examples include OpenAI, AWS Bedrock, Together.ai and more. Companies are often both Model Creators and Providers.
- Serverless: Cloud service provided on an as-used basis, in relation to LLM inference APIs generally means priced per token of input and output. Serverless cloud products do still run on servers!
- Open Weights: A model whose weights have been released publicly by the model's creator. We refer to 'open weights' or just 'open' models rather than 'open-source' as many open LLMs have been released with licenses that do not meet the full definition of open-source software.
- Token: Modern LLMs are built around tokens - numerical representations of words and characters. LLMs take tokens as input and generate tokens as output. Input text is translated into tokens by a tokenizer. Different LLMs use different tokenizers.
- OpenAI Tokens: Tokens as generated by OpenAI's GPT-3.5 and GPT-4 tokenizer, generally measured for Artificial Analysis benchmarking with OpenAI's tiktoken package for Python. We use OpenAI tokens as a standard unit of measurement across Artificial Analysis to allow fair comparisons between models. All 'tokens per second' metrics refer to OpenAI tokens.
- Native Tokens: Tokens as generated by an LLM's own tokenizer. We refer to 'native tokens' to distinguish from 'OpenAI tokens'. Prices generally refer to native tokens.
- Price (Input/Output): The price charged by a provider per input token sent to the model and per output token received from the model. Prices shown are the current prices listed by providers.
- Price (Blended): To enable easier comparison, we calculate a blended price assuming a 3:1 ratio of input to output tokens.
- (3 * Input Price + Output Price) / 4
- Time to First Token (TTFT): The time in seconds between sending a request to the API and receiving the first token of the response.
- Time of First Token Arrival - Time of Request Sent
- Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
- (Total Tokens - First Chunk Tokens) / (Time of Final Token Chunk Received - Time of First Token Chunk Received)
- Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, measured end to end.
- Time to First Token + (100 / Output Speed)
- Quality Index: A simplified metric for understanding the relative quality of models. Currently calculated using normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench. These values are normalized and combined for easy comparison of models. We find Quality Index to be very helpful for comparing relative positions of models, especially when comparing quality with speed or price metrics on scatterplots, but we do not recommend citing Quality Index values directly.
Performance Benchmarking Overview
Measuring LLM performance requires sending the LLM a prompt and measuring the characteristics of its output. We use a variety of test workloads to test and measure LLM performance.
We benchmark the following workload types:
- 100 input token workload: Approximately 100 input tokens, 300 output tokens
- 1k input token workload: Approximately 1,000 input tokens, 1,000 output tokens (default benchmark on our website)
- 10k input token workload: Approximately 10,000 input tokens, 1,500 output tokens
- 100k input token workload: Approximately 100,000 input tokens, 2,000 output tokens
Longer prompts can result in both longer time to first token and slower output tokens per second compared to shorter prompts.
We test two load scenarios:
- Single prompt: One prompt is sent to the model at a time
- Parallel prompts: 10 prompts are sent to the model simultaneously (please note that we consider this test to remain in beta due to impact of rate limits)
We run our test workloads with the following frequencies:
- Our 100, 1k, 10k input token workloads are tested 8 times per day, approximately every 3 hours
- For our multiple or parallel workload test, we send 10 concurrent requests of our standard 1k input token workload once per day at a random time
- Our 100k input token workload is tested once per week at a random time
Every individual test-run uses a unique prompt which we generate at the time of the test and run on all endpoints we cover.
Performance measurements are represented as the median (P50) measurement over the past 14 days or otherwise to reflect sustained changes in performance that users can expect to experience when using the API.
Performance Benchmarking Technical Details
Server Location: Our primary testing server is a virtual machine hosted in Google Cloud's us-central1-a zone.
Test Accounts: We conduct tests using a combination of anonymous accounts, accounts with credits, and API keys provided explicitly for benchmarking to undertake testing. Where our primary benchmarking is not undertaken via an anonymous account, we register a separate anonymous account and validate that performance is not being manipulated.
API Libraries: For all providers claiming compatibility with OpenAI's API, we use the official OpenAI Python library to ensure consistency across tests. For providers without OpenAI compatibility, we use their recommended client libraries.
Temperature: We use a temperature setting of 0.2 for all tests. In general, we do not observe significant effects of temperature on performance.
Token Measurement: All measurements of 'tokens' on Artificial Analysis refer to OpenAI tokens as counted by OpenAI's tiktoken library.
Quality Benchmarking Overview
Artificial Analysis independently runs quality evaluations on every language model endpoint covered on our site. Our current set of evaluations includes MMLU, GPQA, MATH-500, andHumanEval. Full details coming soon.
Known Limitations
Tokenizer Efficiency and Pricing: Different models use different tokenizers, which can lead to differences in the number of tokens required to represent the same text. This means that pricing is not always directly comparable across models. We are working on publishing more details on tokenizer efficiency and its impact on pricing. In the meantime, we have shared some preliminary tokenizer efficiency adjusted pricing analysis on Twitter.
Quantization: Some models use quantization techniques to reduce computational requirements and increase speed. However, quantization can also affect model quality. We are moving towards full disclosure of quantization methods used by the models we benchmark.
Server Location and TTFT: Time-to-first-token (TTFT) is sensitive to server location as it includes network latency. Our primary testing server is located in Google Cloud's us-central1-a zone, which may advantage or disadvantage certain providers based on their server locations. We are considering adding additional testing locations to mitigate this effect.