Methodology

Further detail coming shortly

Scope

Artificial Analysis performs benchmarking on large language model (LLM) inference delivered via serverless API endpoints. This page describes our LLM benchmarking methodology, including both our quality benchmarking and performance benchmarking.

We consider LLM endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to a system. Typically this means that endpoints are priced on a per token basis, often with different prices for input and output tokens.

Our performance benchmarking measures the end-to-end performance experienced by customers of LLM inference services. This means that benchmark results are not intended to represent the maximum possible performance on any particular hardware platform, they are intended to represent the real-world performance customers experience across providers.

We benchmark both proprietary and open weights models.

Definitions

On this page, and across the Artificial Analysis website, we use the following terms:

  • Model: A large language model (LLM), including proprietary, open source and open weights models.
  • Model Creator: The organization that developed and trained the model. For example, OpenAI is the creator of GPT-4 and Meta is the creator of Llama 3.
  • Endpoint: A hosted instance of a model that can be accessed via an API. A single model may have multiple endpoints across different providers.
  • Provider: A company that hosts and provides access to one or more model endpoints via an API. Examples include OpenAI, AWS Bedrock, Together.ai and more. Companies are often both Model Creators and Providers.
  • Serverless: Cloud service provided on an as-used basis, in relation to LLM inference APIs generally means priced per token of input and output. Serverless cloud products do still run on servers!
  • Open Weights: A model whose weights have been released publicly by the model's creator. We refer to 'open weights' or just ‘open’ models rather than 'open-source' as many open LLMs have been released with licenses that do not meet the full definition of open-source software.
  • Token: Modern LLMs are built around tokens - numerical representations of words and characters. LLMs take tokens as input and generate tokens as output. Input text is translated into tokens by a tokenizer. Different LLMs use different tokenizers.
  • OpenAI Tokens: Tokens as generated by OpenAI’s GPT-3.5 and GPT-4 tokenizer, generally measured for Artificial Analysis benchmarking with OpenAI’s tiktoken package for Python. We use OpenAI tokens as a standard unit of measurement across Artificial Analysis to allow fair comparisons between models. All ‘tokens per second’ metrics refer to OpenAI tokens.
  • Native Tokens: Tokens as generated by an LLM’s own tokenizer. We refer to ‘native tokens’ to distinguish from ‘OpenAI tokens’. Prices generally refer to native tokens.
  • Price (Input/Output): The price charged by a provider per input token sent to the model and per output token received from the model. Prices shown are the current prices listed by providers.
  • Price (Blended): To enable easier comparison, we calculate a blended price assuming a 3:1 ratio of input to output tokens.
  • (3 * Input Price + Output Price) / 4
  • Time to First Token (TTFT): The time in seconds between sending a request to the API and receiving the first token of the response.
  • Time of First Token Arrival - Time of Request Sent
  • Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
  • (Total Tokens - First Chunk Tokens) / (Time of Final Token Chunk Received - Time of First Token Chunk Received)
  • Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, measured end to end.
  • Time to First Token + (100 / Output Speed)
  • Quality Index: A simplified metric for understanding the relative quality of models. Currently calculated using normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench. These values are normalized and combined for easy comparison of models. We find Quality Index to be very helpful for comparing relative positions of models, especially when comparing quality with speed or price metrics on scatterplots, but we do not recommend citing Quality Index values directly.

Performance Benchmarking Overview

Measuring LLM performance requires sending the LLM a prompt and measuring the characteristics of its output. We use a variety of test workloads to test and measure LLM performance.

We test three input prompt lengths:

  • Short prompts: Approximately 100 input tokens
  • Medium prompts: Approximately 1,000 input tokens
  • Long prompts: Approximately 10,000 input tokens

Longer prompts can result in both longer time to first token and slower output tokens per second compared to shorter prompts.

We test two load scenarios:

  • Single prompt: One prompt is sent to the model at a time
  • Parallel prompts: 10 prompts are sent to the model simultaneously (please note that we consider this test to remain in beta due to impact of rate limits)

We sample 500 output tokens for all tests.

We run our test workloads every day, with the following frequencies:

  • Single prompts are tested 8 times per day with randomized intervals
  • Parallel prompts are tested once per day at a randomized time

Every individual test-run uses a unique prompt which we generate at the time of the test and run on all endpoints we cover.

Performance Benchmarking Technical Details

Server Location: Our primary testing server is a virtual machine hosted in Google Cloud's us-central1-a zone.

Test Accounts: We conduct tests using a combination of anonymous accounts, accounts with credits, and API keys provided explicitly for benchmarking to undertake testing. Where our primary benchmarking is not undertaken via an anonymous account, we register a separate anonymous account and validate that performance is not being manipulated.

API Libraries: For all providers claiming compatibility with OpenAI's API, we use the official OpenAI Python library to ensure consistency across tests. For providers without OpenAI compatibility, we use their recommended client libraries.

Temperature: We use a temperature setting of 0.2 for all tests. In general, we do not observe significant effects of temperature on performance.

Token Measurement: All measurements of 'tokens' on Artificial Analysis refer to OpenAI tokens as counted by OpenAI's tiktoken library.

Quality Benchmarking Overview

Major update to our quality benchmarking approach coming soon. See below for details of our current approach.

To provide an easy-to-understand summary metric, especially for use on charts to compare models, we calculate an overall Quality Index for each model. This index is a weighted average of normalized scores from MMLU, MT-Bench, and Chatbot Arena. Benchmark scores are sourced from model creators for MMLU, MT-Bench and HumanEval where reported, and from LMSYS Chatbot Arena Leaderboard for Chatbot Arena scores.

We first normalize each score by scaling it between 0 and 100 based on the minimum and maximum scores observed across all models. We then take a weighted average of these normalized scores. This means that if two models perform equally well at MMLU, but one of them has a lower Chatbot Arena Elo Score, that model will have a correspondingly lower Quality Index.

Known Limitations

Tokenizer Efficiency and Pricing: Different models use different tokenizers, which can lead to differences in the number of tokens required to represent the same text. This means that pricing is not always directly comparable across models. We are working on publishing more details on tokenizer efficiency and its impact on pricing. In the meantime, we have shared some preliminary tokenizer efficiency adjusted pricing analysis on Twitter.

Quantization: Some models use quantization techniques to reduce computational requirements and increase speed. However, quantization can also affect model quality. We are moving towards full disclosure of quantization methods used by the models we benchmark.

Server Location and TTFT: Time-to-first-token (TTFT) is sensitive to server location as it includes network latency. Our primary testing server is located in Google Cloud's us-central1-a zone, which may advantage or disadvantage certain providers based on their server locations. We are considering adding additional testing locations to mitigate this effect.