Menu

logo
Artificial Analysis
HOME

Artificial Analysis Performance Benchmarking Methodology

Overview

Measuring LLM performance requires sending the LLM a prompt and measuring the characteristics of its output. We use a variety of test workloads to test and measure LLM performance.

We benchmark the following workload types:

  • 100 input token workload: Approximately 100 input tokens, 300 output tokens
  • 1k input token workload: Approximately 1,000 input tokens, 1,000 output tokens (default benchmark on our website)
  • 10k input token workload: Approximately 10,000 input tokens, 1,500 output tokens
  • 100k input token workload: Approximately 100,000 input tokens, 2,000 output tokens

Longer prompts can result in both longer time to first token and slower output tokens per second compared to shorter prompts.

We test two load scenarios:

  • Single prompt: One prompt is sent to the model at a time
  • Parallel prompts: 10 prompts are sent to the model simultaneously (please note that we consider this test to remain in beta due to impact of rate limits)

We run our test workloads with the following frequencies:

  • Our 100, 1k, 10k input token workloads are tested 8 times per day, approximately every 3 hours
  • For our multiple or parallel workload test, we send 10 concurrent requests of our standard 1k input token workload once per day at a random time
  • Our 100k input token workload is tested once per week

Every individual test-run uses a unique prompt which we generate at the time of the test and run on all endpoints we cover.

Performance measurements are represented as the median (P50) measurement over the past 72 hours to reflect sustained changes in performance that users can expect to experience when using the API. An exception to this is the 100k prompt length workload, which is tested once per week and is represented as the median (P50) measurement over the past 14 days.

Performance Benchmarking Technical Details

Server Location: Our primary testing server is a virtual machine hosted in Google Cloud's us-central1-a zone.

Test Accounts: We conduct tests using a combination of anonymous accounts, accounts with credits, and API keys provided explicitly for benchmarking to undertake testing. Where our primary benchmarking is not undertaken via an anonymous account, we register a separate anonymous account and validate that performance is not being manipulated.

API Libraries: For all providers claiming compatibility with OpenAI's API, we use the official OpenAI Python library to ensure consistency across tests. For providers without OpenAI compatibility, we use their recommended client libraries.

Temperature: We use a temperature setting of 0.2 for all tests. In general, we do not observe significant effects of temperature on performance.

Token Measurement: All measurements of 'tokens' on Artificial Analysis refer to OpenAI tokens as counted by OpenAI's tiktoken library.

Known Limitations

Tokenizer Efficiency and Pricing: Different models use different tokenizers, which can lead to differences in the number of tokens required to represent the same text. This means that pricing is not always directly comparable across models. We are working on publishing more details on tokenizer efficiency and its impact on pricing. In the meantime, we have shared some preliminary tokenizer efficiency adjusted pricing analysis on Twitter.

Quantization: Some models use quantization techniques to reduce computational requirements and increase speed. However, quantization can also affect model quality. We are moving towards full disclosure of quantization methods used by the models we benchmark.

Server Location and TTFT: Time-to-first-token (TTFT) is sensitive to server location as it includes network latency. Our primary testing server is located in Google Cloud's us-central1-a zone, which may advantage or disadvantage certain providers based on their server locations. We are considering adding additional testing locations to mitigate this effect.