Coding Agent Index Methodology

Overview

Artificial Analysis benchmarks coding agents on end-to-end software engineering tasks. The goal is to measure how well agents complete realistic coding work, and how performance varies across outcome, reliability, token usage, cost, and execution time.

Public results on the Coding Agent Index page are built from task-level benchmark attempts and are aggregated into per-evaluation scores, pooled efficiency metrics, and the Artificial Analysis Coding Agent Index.

This page focuses on how the public Artificial Analysis Coding Agent Index is constructed, what benchmark components are currently included, and how the public pass@1, cost, token-usage, and execution-time metrics are derived.

Artificial Analysis Coding Agent Index

The current public Artificial Analysis Coding Agent Index is a composite benchmark score built from the configured benchmark components in the public coding-agents suite.

The point of the index is not to collapse all coding work into one benchmark task type. Different coding agents can perform very differently on repository Q&A, implementation and bug-fix tasks, and terminal-heavy workflows. The index exists to summarize those different benchmark families into one top-level performance view while preserving the per-benchmark breakdowns underneath.

Index Components

The current public index includes the following benchmark components:

EvaluationFieldQuestionsRepeatsResponse TypeScoring
SWE-Bench-Pro-Hard-AACode Generation1503Code patch / repository changesTest suite pass/fail, pass@1
Terminal-Bench v2Agentic Terminal Use84*3Terminal-based task executionTest suite pass/fail, pass@1
SWE-Atlas-QnARepository Q&A1243Open AnswerRubric-based grading, pass@1

* Terminal-Bench v2 originally contains 89 tasks; we exclude five tasks because of environment compatibility issues.

What The Index Aggregates

For each agent variant, Artificial Analysis computes a pass@1 score for each included benchmark component and then aggregates those component scores into the public index.

The same benchmark suite also underlies the public pooled efficiency metrics on the benchmark page, including cost to run, token usage, and execution time. That means the performance and efficiency views are aligned to the same underlying benchmark coverage rather than being drawn from unrelated runs.

Scoring And Outcomes

pass@1 Results

Each evaluated attempt receives a pass@1 result from the benchmark evaluator. Test-suite evaluations are scored as pass or fail, while rubric-based evaluations can award partial credit.

TermDefinition
Binary pass@1A test-suite evaluation result where a task receives either 1 for pass or 0 for fail.
Partial-credit pass@1A rubric-based evaluation result where a task can receive any score between 0 and 1.

Per-Evaluation Scores

For each evaluation, the public benchmark score is the average of the task-level pass@1 results for a given agent variant. When an evaluation uses multiple repeats, those repeat results are included in the same average.

Efficiency Metrics

Cost, token usage, and execution time are reported as pooled per-task-attempt means across the current public coding-agents benchmark suite.

  • Cost to run: average pay per token API cost per task, based on provider token pricing rather than consumer plans.
  • Token usage: average input, cache, cache-write, reasoning, and output tokens per task.
  • Execution time: average wall-clock runtime per task, including full task wall time and the agent wall-time subset where available.

Where telemetry is missing for a given metric, those missing values are excluded from the corresponding average rather than treated as zero.

In the cost metric, cached input is treated separately from uncached input where provider pricing supports that distinction, and cache-write charges are included when providers bill for creating prompt cache state. This is intended to reflect pay per token API pricing more closely than a flat per-token estimate.

Agent Settings

Public benchmark rows represent agent variants, not just model names. Settings that can change behavior are kept distinct in reporting.

Unless otherwise specified, we use each agent's default reasoning settings so the benchmark reflects the default user experience.

Benchmarking methodology may evolve over time as new evaluations and agent variants are added, but public comparisons are intended to reflect like-for-like agent variants within the published benchmark suite.