Artificial Analysis Benchmarking Methodology
Scope
Artificial Analysis performs intelligence, quality, performance and price benchmarking on AI models, and AI inference API endpoints. This section of our website describes our benchmarking methodology, including both our quality benchmarking and performance benchmarking.
For our language model benchmarking, we note that we consider endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to a system. Typically this means that endpoints are priced on a per token basis, often with different prices for input and output tokens.
Across all modalities, our performance benchmarking measures the end-to-end performance experienced by customers of AI inference services. This means that benchmark results are not intended to represent the maximum possible performance on any particular hardware platform, they are intended to represent the real-world performance customers experience across providers.
We benchmark both proprietary and open weights models.
Methodology Details:
Language Model Intelligence
Language Model Performance
Text to Image
Speech to Text
Text to Speech
Speech Reasoning
Definitions
On this page, and across the Artificial Analysis website, we use the following terms:
- Model: A large language model (LLM), including proprietary, open source and open weights models.
- Model Creator: The organization that developed and trained the model. For example, OpenAI is the creator of GPT-4 and Meta is the creator of Llama 3.
- Endpoint: A hosted instance of a model that can be accessed via an API. A single model may have multiple endpoints across different providers.
- Provider: A company that hosts and provides access to one or more model endpoints via an API. Examples include OpenAI, AWS Bedrock, Together.ai and more. Companies are often both Model Creators and Providers.
- Serverless: Cloud service provided on an as-used basis, in relation to LLM inference APIs generally means priced per token of input and output. Serverless cloud products do still run on servers!
- Open Weights: A model whose weights have been released publicly by the model's creator. We refer to 'open weights' or just 'open' models rather than 'open-source' as many open LLMs have been released with licenses that do not meet the full definition of open-source software.
- Token: Modern LLMs are built around tokens - numerical representations of words and characters. LLMs take tokens as input and generate tokens as output. Input text is translated into tokens by a tokenizer. Different LLMs use different tokenizers.
- OpenAI Tokens: Tokens as generated by OpenAI's GPT-3.5 and GPT-4 tokenizer, generally measured for Artificial Analysis benchmarking with OpenAI's tiktoken package for Python. We use OpenAI tokens as a standard unit of measurement across Artificial Analysis to allow fair comparisons between models. All 'tokens per second' metrics refer to OpenAI tokens.
- Native Tokens: Tokens as generated by an LLM's own tokenizer. We refer to 'native tokens' to distinguish from 'OpenAI tokens'. Prices generally refer to native tokens.
- Price (Input/Output): The price charged by a provider per input token sent to the model and per output token received from the model. Prices shown are the current prices listed by providers.
- Price (Blended): To enable easier comparison, we calculate a blended price assuming a 3:1 ratio of input to output tokens.
- Latency (Time to First Token, TTFT): The time in seconds between sending a request to the API and receiving the first token of the response.
- Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
- Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, calculated synthetically based on TTFT and Output Speed to assure maximum comparison utility.