Q: What is the Artificial Analysis Coding Agent Index?

The Artificial Analysis Coding Agent Index is our composite score for coding-agent performance across the public benchmark suite on this page. It currently combines SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to capture implementation, terminal workflow, and repository-understanding performance in a single headline metric.

Question 1

What is the Artificial Analysis Coding Agent Index?

Accepted Answer

The Artificial Analysis Coding Agent Index is a composite score built from SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA.

It is useful for quick comparison, but it should be read alongside the per-eval breakdowns. Two agents with similar index values can still have different strengths across repository tasks, terminal workflows, and rubric-based evaluations.

Question 2

Which benchmarks are included in the index right now?

Accepted Answer

The current public index includes SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA. These benchmarks are combined because they stress different parts of the coding-agent workflow rather than repeating the same task format.

Question 3

What kinds of tasks are these benchmarks actually testing?

Accepted Answer

The public benchmark suite mixes several software engineering task styles. Some tasks are Q&A and repository-understanding tasks that focus on reading a codebase, understanding architecture or behavior, and producing a correct technical answer. Some are implementation and bug-fix tasks that require code changes and are closer to the classic make-a-patch-that-works framing. Some are terminal workflow tasks that test whether the agent can navigate a shell-driven environment, execute tools correctly, and complete a multi-step command-line workflow. The suite also mixes effectively binary outcomes with rubric-scored partial-credit outcomes, which matters because an agent can show useful progress on a difficult task without fully solving it.

Question 4

How do Q&A-style tasks differ from SWE-Bench-Pro-Hard-AA-style tasks?

Accepted Answer

Q&A-style tasks emphasize repository understanding, code reading, tracing behavior, and producing a correct technical explanation. SWE-Bench-Pro-Hard-AA-style tasks are closer to shipping a working change: the agent has to understand the task, navigate the repository, edit files correctly, and satisfy an evaluator or test-based outcome under execution constraints. Those are related capabilities, but they are not identical. An agent can be strong at repository reasoning and still be weaker at reliable patch execution, or vice versa, which is one reason the composite index should be interpreted alongside the per-benchmark chart.

Question 5

How are agents scored on each benchmark?

Accepted Answer

The benchmark page reports component scores using average pass@1. This is the evaluator-assigned score for a task, and depending on the benchmark it can be either binary or partial credit. A passed run is not automatically the same thing as a solved task: a run can complete cleanly and still receive a zero score. In the current methodology, a task is counted as solved only when it passed and received a positive score. This matters especially for rubric-scored tasks such as SWE-Atlas-QnA, where partial credit can capture useful progress that would be lost in a strict pass-fail metric.

Question 6

How is the overall index weighted?

Accepted Answer

The index is computed from the configured benchmark components that make up the current public suite. For the current Artificial Analysis Coding Agent Index, the public methodology is a simple average across the available component benchmark scores, and the current configuration treats the included benchmark components as equal components of that composite. Benchmark methodology can evolve as coverage improves, so comparability is best interpreted within the published benchmark suite and its current component set rather than as a timeless absolute score.

Question 7

What does execution time mean?

Accepted Answer

Execution time on this page refers to wall-clock task runtime per task, not just raw model latency. It is meant to reflect the user-facing time cost of running the whole agent workflow. That includes time spent reasoning, issuing tool calls, reading and writing files, executing shell steps, and waiting on model responses. So an agent can have a fast underlying model and still be slower overall if its workflow is longer or more tool-heavy.

Question 8

What does token usage mean, and why does it matter?

Accepted Answer

Token usage is the average observed token consumption per task across the benchmark suite. On this page we break it out into input, cache, and output tokens. Input tokens are the tokens sent into the model, including prompts, instructions, tool context, and task context. Cache tokens are prompt tokens reused through prompt caching when the provider exposes that telemetry. Output tokens are tokens generated by the model in its response. Token usage matters because it often drives cost and can also indicate how much context an agent consumes to get work done, but token efficiency and cost are not identical because providers price token categories differently and caching can materially change the bill.

Question 9

Why can a higher-index agent still be worse for my use case?

Accepted Answer

A higher index score means stronger performance across the included benchmark mix, but it does not mean the agent is best for every workflow. The index is a balance across benchmark quality, not a direct measure of your specific latency, cost, tooling, or task-type priorities. Real-world choice still depends on whether your workflow looks more like repository Q&A, patching, or terminal execution, and on practical constraints such as IDE integration, model availability, and reliability.

Question 10

How realistic are these tasks, and what setup was used for each agent?

Accepted Answer

These benchmarks measure coding-agent performance across repositories, tools, multi-step workflows, and evaluator-based outcomes. Results on this page reflect specific evaluated agent variants, not just generic product names: model choice, settings, and execution configuration can materially change outcomes, which is why a single agent family may appear in multiple variants in the results. For more background on benchmark runs, task-level scoring, and methodology, see the coding-agents benchmarking methodology page.

Artificial Analysis Coding Agent Benchmarks

Artificial Analysis Coding Agent Index

Coding Agent Index

Time per Task

Cost per Task

Performance

Artificial Analysis Coding Agent Index

Harness Comparison

Harness Comparison: Artificial Analysis Coding Agent Index

Token Usage

Token Usage per Task

Artificial Analysis Coding Agent Index vs. Total Tokens

Cost

Cost per Task

Artificial Analysis Coding Agent Index vs. Cost per Task

Execution Time

Time per Task

Artificial Analysis Coding Agent Index vs. Execution Time

Run Specifications

Frequently Asked Questions

Artificial Analysis Coding Agent Benchmarks

Artificial Analysis Coding Agent Index

Coding Agent Index

Time per Task

Cost per Task

Performance

Artificial Analysis Coding Agent Index

What this metric means

Harness Comparison

Harness Comparison: Artificial Analysis Coding Agent Index

What this chart shows

Token Usage

Token Usage per Task

Input tokens

Cached input tokens

Why cache usage varies

Output tokens

Artificial Analysis Coding Agent Index vs. Total Tokens

How to read this chart

Cost

Cost per Task

What cost is measuring

Artificial Analysis Coding Agent Index vs. Cost per Task

How to read this chart

Execution Time

Time per Task

What execution time is measuring

Artificial Analysis Coding Agent Index vs. Execution Time

How to read this chart

Run Specifications

Frequently Asked Questions

What is the Artificial Analysis Coding Agent Index?

Which benchmarks are included in the index right now?

What kinds of tasks are these benchmarks actually testing?

How do Q&A-style tasks differ from SWE-Bench-Pro-Hard-AA-style tasks?

How are agents scored on each benchmark?

How is the overall index weighted?

What does execution time mean?

What does token usage mean, and why does it matter?

Why can a higher-index agent still be worse for my use case?

How realistic are these tasks, and what setup was used for each agent?