Artificial Analysis Language Model API Performance Benchmarking Methodology

Overview

Measuring LLM performance requires sending the LLM a prompt and measuring the characteristics of its output. We use a variety of test workloads to test and measure LLM performance.

Workload Types:

100 input token workload: Approximately 100 input tokens, 300 output tokens
1k input token workload: Approximately 1,000 input tokens, 1,000 output tokens (default benchmark on our website)
10k input token workload: Approximately 10,000 input tokens, 1,500 output tokens
100k input token workload: Approximately 100,000 input tokens, 2,000 output tokens

Longer prompts can result in both longer time to first token and slower output tokens per second compared to shorter prompts.

Load Scenarios:

Single prompt: One prompt is sent to the model's API at a time
Parallel prompts: 10 prompts are sent to the model's API simultaneously

Testing Frequency:

Our 100, 1k, 10k input token workloads are tested 8 times per day, approximately every 3 hours
For our multiple or parallel workload test, we send 10 concurrent requests of our standard 1k input token workload once per day at a random time
Our 100k input token workload is tested once per week

Prompt Generation:

Every individual test-run uses a unique prompt which we generate at the time of the test and run on all endpoints we cover. We have tested our prompts as defensive against speculative decoding in that while speculative decoding has an impact, output speeds are well below speeds that would be expected if the draft model was matching most of the tokens generated by the main model.

Measurement Representation:

Performance measurements are represented as the median (P50) measurement over the past 72 hours to reflect sustained changes in performance that users can expect to experience when using the API. An exception to this is the 100k prompt length workload, which is tested once per week and is represented as the median (P50) measurement over the past 14 days.

Key Definitions

Latency: Time to First Token: The time in seconds between sending a request to the service or system and receiving the first token of the response. For reasoning models which return reasoning tokens, this will be the first reasoning token.
$\text{Time to First Token} = \text{Time of First Token Arrival} - \text{Time Request Sent}$
Latency: Time to First Answer Token:The time in seconds between sending a request to the service or system and receiving the first answer token of the response. For reasoning models, this is measured after any 'thinking' time.
$\text{Time to First Answer Token} = \text{Input Processing Time} + \frac{\text{Avg. Reasoning Tokens}}{\text{Reasoning Output Speed}}$
Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
$\text{Output Speed} = \frac{\text{Total Tokens} - \text{First Chunk Tokens}}{\text{Time of Final Token Chunk Received} - \text{Time of First Token Chunk Received}}$
Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, calculated synthetically based on TTFT and Output Speed to assure maximum comparison utility.
$\text{Total Response Time} = \text{Time to First Token} + \frac{100}{\text{Output Speed}}$
End-to-End Response Time: The total time to receive a complete response, including input processing time, model reasoning time, and answer generation time.
$\text{End-to-End Response Time} = \text{Input Processing Time} + \frac{\text{Avg. Reasoning Tokens}}{\text{Reasoning Output Speed}} + \frac{500}{\text{Answer Output Speed}}$
Average Reasoning Tokens: Time reasoning models spend outputting 'reasoning' tokens before providing an answer. This is calculated based on the average number of 'reasoning' tokens across a diverse set of 60 prompts. Where the average number of reasoning tokens is not available or has not yet been calculated, we assume 2k reasoning tokens. These prompts are of varied lengths and include coverage of a range of topics including, personal-related queries, commercial-related queries, coding, math, science and others. Prompts are a combination of being written by Artificial Analysis and others sourced from the following evaluations: MMLU Pro, AIME 2025, and LiveCodeBench. These prompts can be accessed here.

Technical Details

Server Location: Our primary testing server is a virtual machine hosted in Google Cloud's us-central1-a zone.

Test Accounts: We conduct tests using a combination of anonymous accounts, accounts with credits, and API keys provided explicitly for benchmarking to undertake testing. Where our primary benchmarking is not undertaken via an anonymous account, we register a separate anonymous account and validate that performance is not being manipulated.

API Libraries: For all providers claiming compatibility with OpenAI's API, we use the official OpenAI Python library to ensure consistency across tests. For providers without OpenAI compatibility, we use their recommended client libraries.

API parameters: We use the following API parameters for all tests:

temperature: 0
top_p: 1

Token Measurement: All measurements of 'tokens' on Artificial Analysis as measured as OpenAI GPT-4 tokens as counted by OpenAI's tiktoken library (o200k_base). This standardizes the number of tokens counted across different models (with different tokenizers) so that the same text is represented as the same number of tokens.

Known Limitations

Tokenizer Efficiency and Pricing: Different models use different tokenizers, which can lead to differences in the number of tokens required to represent the same text. This means that pricing is not always directly comparable across models. We are working on publishing more details on tokenizer efficiency and its impact on pricing. In the meantime, we have shared some preliminary tokenizer efficiency adjusted pricing analysis on Twitter.

Quantization: Some models use quantization techniques to reduce computational requirements and increase speed. However, quantization can also affect model quality. We are moving towards full disclosure of quantization methods used by the models we benchmark.

Server Location and TTFT: Time-to-first-token (TTFT) is sensitive to server location as it includes network latency. Our primary testing server is located in Google Cloud's us-central1-a zone, which may advantage or disadvantage certain providers based on their server locations. We are considering adding additional testing locations to mitigate this effect.

Artificial Analysis

Insights Login