AA-AgentPerf Methodology

Overview

AA-AgentPerf is a hardware benchmark that measures how many active users an inference deployment can support under realistic agentic workloads while meeting user experience performance targets (time to first token and output speed).

  • Real agentic trajectories — multi-turn coding sessions with interleaved reasoning, tool calls, and variable context lengths (not synthetic uniform prompts).
  • Sustained concurrent load — simulated users maintain continuous in-flight requests, stressing KV cache reuse, speculative decoding, and scheduler behavior.
  • Market-derived SLO tiers — performance thresholds based on Artificial Analysis serverless API benchmarking data, reflecting quality-of-service levels observed across providers.
  • Continuously updated — results are updated on an ongoing basis as new hardware, software stacks, and model versions become available
  • Production ready — models are tested with realistic optimizations enabled and production scale deployment topologies
AI ServerAgent 1Agent 2Agent 3Agent 4Agent 5Agent 6Agent 7Agent 8Agent 9System ThroughputOutput Speed per User

Each simulated user works sequentially through agentic coding trajectories — reasoning, calling tools, and editing code — while the system scales to support increasing concurrency.

Dataset

The AA-AgentPerf dataset contains real agentic trajectories covering several use cases, programming languages, and models. Trajectories were generated from public code repositories using top open-source models with reasoning enabled.

  • Input Sequence Length (ISL): Ranges from ~1K to ~131K tokens, with a mean of approximately 27K tokens. Trajectories are truncated to fit within the maximum recommended context length of the models under test.
  • Output Sequence Length (OSL): Median of ~150 tokens, with a P95 of ~2K tokens (for DeepSeek V3.2). Output lengths vary based on reasoning verbosity and the complexity of the tool calls being generated.
  • Languages: 12+ programming languages are represented, based on the primary language of the source repository.

A representative tuning subset is provided to test participants for configuration validation and performance tuning. The full test dataset is kept private to prevent benchmark-targeted optimizations.

Service-Level Objectives

Performance SLOs are determined based on Artificial Analysis serverless API benchmarking data. Currently available tiers of service in the market have been identified for each model being tested. Vendors target each tier separately, maximizing the number of concurrent users they can support at that level of service.

Speed and latency are both calculated at a per-request level. Due to the large number of small-OSL requests in agentic workloads, P25 output speed is used (rather than P5), with percentiles computed over all requests sent during a phase.

ModelSLO TierP25 Output Speed (tokens/s)P95 TTFT (s)
DeepSeek V3.2 (Thinking)SLO #13015
SLO #210015
SLO #330015
gpt-oss-120b (high)SLO #110010
SLO #225010
SLO #350010
SLO #42,00010

Performance SLOs

Test Execution

The number of supported users for each SLO is determined via binary search after an initial exponential ramp. Metrics are calculated using steady-state token timing information. After a phase is completed and metrics have been calculated, the system determines whether SLOs have been violated, then proceeds to the next target concurrency level.

0100200300400048121620Concurrent UsersP25 Output Speed (t/s)SLO threshold (100 t/s)
PhaseUsersP25 SpeedResult
Max users: 11

As concurrent users increase, per-request output speed degrades. Each SLO tier defines a minimum acceptable speed, determining the maximum users that tier can support.

Metrics & Results

The following timing metrics are calculated during each test phase:

  • Time to First Token (TTFT): Per-request latency from sending the request to receiving the first output token.
  • Output Speed: Per-request output tokens per second, measured after the first token is received.
  • System Output Throughput: Aggregate output tokens per second across all concurrent users.

All metrics are filtered to steady-state periods where all users have been active for at least 30 seconds at target concurrency. The primary results — maximum concurrent users and output throughput at each SLO tier — are normalized per accelerator, per kW (TDP), per rack, and per $/hr to enable fair comparison across hardware configurations.

Results are published on the Artificial Analysis leaderboard with full system configurations as disclosed by the provider. Providers may review their results for factual accuracy before publication.

Submissions

If you are a hardware vendor interested in submitting your system for AA-AgentPerf evaluation and are not already in touch, reach out to agentperf@artificialanalysis.ai.