Artificial Analysis Performance Benchmarking Methodology
Overview
Measuring LLM performance requires sending the LLM a prompt and measuring the characteristics of its output. We use a variety of test workloads to test and measure LLM performance.
Workload Types:
- 100 input token workload: Approximately 100 input tokens, 300 output tokens
- 1k input token workload: Approximately 1,000 input tokens, 1,000 output tokens (default benchmark on our website)
- 10k input token workload: Approximately 10,000 input tokens, 1,500 output tokens
- 100k input token workload: Approximately 100,000 input tokens, 2,000 output tokens
Longer prompts can result in both longer time to first token and slower output tokens per second compared to shorter prompts.
Load Scenarios:
- Single prompt: One prompt is sent to the model's API at a time
- Parallel prompts: 10 prompts are sent to the model's API simultaneously
Testing Frequency:
- Our 100, 1k, 10k input token workloads are tested 8 times per day, approximately every 3 hours
- For our multiple or parallel workload test, we send 10 concurrent requests of our standard 1k input token workload once per day at a random time
- Our 100k input token workload is tested once per week
Prompt Generation:
Every individual test-run uses a unique prompt which we generate at the time of the test and run on all endpoints we cover. We have tested our prompts as defensive against speculative decoding in that while speculative decoding has an impact, output speeds are well below speeds that would be expected if the draft model was matching most of the tokens generated by the main model.
Measurement Representation:
Performance measurements are represented as the median (P50) measurement over the past 72 hours to reflect sustained changes in performance that users can expect to experience when using the API. An exception to this is the 100k prompt length workload, which is tested once per week and is represented as the median (P50) measurement over the past 14 days.
Key Definitions
- Latency: Time to First Token: The time in seconds between sending a request to the API and receiving the first token of the response. For reasoning models which return reasoning tokens, this will be the first reasoning token.
- Latency: Time to First Answer Token:The time in seconds between sending a request to the API and receiving the first answer token of the response. For reasoning models, this is measured after any 'thinking' time.
- Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
- Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, calculated synthetically based on TTFT and Output Speed to assure maximum comparison utility.
- End-to-End Response Time: The total time to receive a complete response, including input processing time, model reasoning time, and answer generation time.
- Average Reasoning Tokens: Time reasoning models spend outputting 'reasoning' tokens before providing an answer. This is calculated based on the average number of 'reasoning' tokens across a diverse set of 60 prompts. Where the average number of reasoning tokens is not available or has not yet been calculated, we assume 2k reasoning tokens. These prompts are of varied lengths and include coverage of a range of topics including, personal-related queries, commercial-related queries, coding, math, science and others. Prompts are a combination of being written by Artificial Analysis and others sourced from the following evaluations: MMLU Pro, AIME 2024, HumanEval, and LiveCodeBench. These prompts can be accessed here.
Technical Details
Server Location: Our primary testing server is a virtual machine hosted in Google Cloud's us-central1-a zone.
Test Accounts: We conduct tests using a combination of anonymous accounts, accounts with credits, and API keys provided explicitly for benchmarking to undertake testing. Where our primary benchmarking is not undertaken via an anonymous account, we register a separate anonymous account and validate that performance is not being manipulated.
API Libraries: For all providers claiming compatibility with OpenAI's API, we use the official OpenAI Python library to ensure consistency across tests. For providers without OpenAI compatibility, we use their recommended client libraries.
- temperature: 0
- top_p: 1
Token Measurement: All measurements of 'tokens' on Artificial Analysis as measured as OpenAI GPT-4 tokens as counted by OpenAI's tiktoken library (cl_100K). This standardizes the number of tokens counted across different models (with different tokenizers) so that the same text is represented as the same number of tokens.
Known Limitations
Tokenizer Efficiency and Pricing: Different models use different tokenizers, which can lead to differences in the number of tokens required to represent the same text. This means that pricing is not always directly comparable across models. We are working on publishing more details on tokenizer efficiency and its impact on pricing. In the meantime, we have shared some preliminary tokenizer efficiency adjusted pricing analysis on Twitter.
Quantization: Some models use quantization techniques to reduce computational requirements and increase speed. However, quantization can also affect model quality. We are moving towards full disclosure of quantization methods used by the models we benchmark.
Server Location and TTFT: Time-to-first-token (TTFT) is sensitive to server location as it includes network latency. Our primary testing server is located in Google Cloud's us-central1-a zone, which may advantage or disadvantage certain providers based on their server locations. We are considering adding additional testing locations to mitigate this effect.