Coding Agent Index Methodology
Overview
Artificial Analysis benchmarks coding agents on end-to-end software engineering tasks. The goal is to measure how well agents complete realistic coding work, and how performance varies across outcome, reliability, token usage, cost, and execution time.
Public results on the Coding Agent Index page are built from task-level benchmark attempts and are aggregated into per-evaluation scores, pooled efficiency metrics, and the Artificial Analysis Coding Agent Index.
This page focuses on how the public Artificial Analysis Coding Agent Index is constructed, what benchmark components are currently included, and how the public pass@1, cost, token-usage, and execution-time metrics are derived.
Artificial Analysis Coding Agent Index
The current public Artificial Analysis Coding Agent Index is a composite benchmark score built from the configured benchmark components in the public coding-agents suite.
The point of the index is not to collapse all coding work into one benchmark task type. Different coding agents can perform very differently on repository Q&A, implementation and bug-fix tasks, and terminal-heavy workflows. The index exists to summarize those different benchmark families into one top-level performance view while preserving the per-benchmark breakdowns underneath.
Index Components
The current public index includes the following benchmark components:
| Evaluation | Field | Questions | Repeats | Response Type | Scoring |
|---|---|---|---|---|---|
| SWE-Bench-Pro-Hard-AA | Code Generation | 150 | 3 | Code patch / repository changes | Test suite pass/fail, pass@1 |
| Terminal-Bench v2 | Agentic Terminal Use | 84* | 3 | Terminal-based task execution | Test suite pass/fail, pass@1 |
| SWE-Atlas-QnA | Repository Q&A | 124 | 3 | Open Answer | Rubric-based grading, pass@1 |
* Terminal-Bench v2 originally contains 89 tasks; we exclude five tasks because of environment compatibility issues.
What The Index Aggregates
For each agent variant, Artificial Analysis computes a pass@1 score for each included benchmark component and then aggregates those component scores into the public index.
The same benchmark suite also underlies the public pooled efficiency metrics on the benchmark page, including cost to run, token usage, and execution time. That means the performance and efficiency views are aligned to the same underlying benchmark coverage rather than being drawn from unrelated runs.
Scoring And Outcomes
pass@1 Results
Each evaluated attempt receives a pass@1 result from the benchmark evaluator. Test-suite evaluations are scored as pass or fail, while rubric-based evaluations can award partial credit.
| Term | Definition |
|---|---|
| Binary pass@1 | A test-suite evaluation result where a task receives either 1 for pass or 0 for fail. |
| Partial-credit pass@1 | A rubric-based evaluation result where a task can receive any score between 0 and 1. |
Per-Evaluation Scores
For each evaluation, the public benchmark score is the average of the task-level pass@1 results for a given agent variant. When an evaluation uses multiple repeats, those repeat results are included in the same average.
Efficiency Metrics
Cost, token usage, and execution time are reported as pooled per-task-attempt means across the current public coding-agents benchmark suite.
- Cost to run: average pay per token API cost per task, based on provider token pricing rather than consumer plans.
- Token usage: average input, cache, cache-write, reasoning, and output tokens per task.
- Execution time: average wall-clock runtime per task, including full task wall time and the agent wall-time subset where available.
Where telemetry is missing for a given metric, those missing values are excluded from the corresponding average rather than treated as zero.
In the cost metric, cached input is treated separately from uncached input where provider pricing supports that distinction, and cache-write charges are included when providers bill for creating prompt cache state. This is intended to reflect pay per token API pricing more closely than a flat per-token estimate.
Agent Settings
Public benchmark rows represent agent variants, not just model names. Settings that can change behavior are kept distinct in reporting.
Unless otherwise specified, we use each agent's default reasoning settings so the benchmark reflects the default user experience.
Benchmarking methodology may evolve over time as new evaluations and agent variants are added, but public comparisons are intended to reflect like-for-like agent variants within the published benchmark suite.