Artificial Analysis Coding Agent Benchmarks
We measure real-world performance of coding agents on software engineering tasks, including cost, token usage, and execution time. We compare how performance changes across agents, models, and execution settings.
To compare language models see our model benchmarks.
Artificial Analysis Coding Agent Index
Composite index of 3 benchmarks:
- SWE-Bench-Pro-Hard-AACode generation, 150 questions
- Terminal-Bench v2Agentic terminal use, 84 questions
- SWE-Atlas-QnATechnical Q&A, 124 questions
Index represents the average pass@1 across 3 runs of each benchmark. View methodology
Highlights
Coding Agent Index
Time per Task
Cost per Task
Performance
Performance across the Artificial Analysis Coding Agent Index.
Artificial Analysis Coding Agent Index
Harness Comparison
Artificial Analysis Coding Agent Index by harness for Claude Opus 4.7.
Harness Comparison: Artificial Analysis Coding Agent Index
Token Usage
Token consumption across the Artificial Analysis Coding Agent Index, including total usage, token mix, efficiency, and per-benchmark breakdowns.
Token Usage per Task
Cost
Cost across the Artificial Analysis Coding Agent Index based on current per-token API pricing, including cache write pricing and cache discounts where available. Many users will access coding agent harnesses through subscription plan offerings rather than pay-per-token.
Cost per Task
Execution Time
Active agent runtime across the Artificial Analysis Coding Agent Index.
Time per Task
Frequently Asked Questions
The Artificial Analysis Coding Agent Index is our composite score for coding-agent performance across the public benchmark suite on this page. It currently combines SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to capture implementation, terminal workflow, and repository-understanding performance in a single headline metric.
The current public index includes SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA. These benchmarks are combined because they stress different parts of the coding-agent workflow rather than repeating the same task format.
The public benchmark suite mixes several software engineering task styles. Some tasks are Q&A and repository-understanding tasks that focus on reading a codebase, understanding architecture or behavior, and producing a correct technical answer. Some are implementation and bug-fix tasks that require code changes and are closer to the classic make-a-patch-that-works framing. Some are terminal workflow tasks that test whether the agent can navigate a shell-driven environment, execute tools correctly, and complete a multi-step command-line workflow. The suite also mixes effectively binary outcomes with rubric-scored partial-credit outcomes, which matters because an agent can show useful progress on a difficult task without fully solving it.
Q&A-style tasks emphasize repository understanding, code reading, tracing behavior, and producing a correct technical explanation. SWE-Bench-Pro-Hard-AA-style tasks are closer to shipping a working change: the agent has to understand the task, navigate the repository, edit files correctly, and satisfy an evaluator or test-based outcome under execution constraints. Those are related capabilities, but they are not identical. An agent can be strong at repository reasoning and still be weaker at reliable patch execution, or vice versa, which is one reason the composite index should be interpreted alongside the per-benchmark chart.
The benchmark page reports component scores using average pass@1. This is the evaluator-assigned score for a task, and depending on the benchmark it can be either binary or partial credit. A passed run is not automatically the same thing as a solved task: a run can complete cleanly and still receive a zero score. In the current methodology, a task is counted as solved only when it passed and received a positive score. This matters especially for rubric-scored tasks such as SWE-Atlas-QnA, where partial credit can capture useful progress that would be lost in a strict pass-fail metric.
The index is computed from the configured benchmark components that make up the current public suite. For the current Artificial Analysis Coding Agent Index, the public methodology is a simple average across the available component benchmark scores, and the current configuration treats the included benchmark components as equal components of that composite. Benchmark methodology can evolve as coverage improves, so comparability is best interpreted within the published benchmark suite and its current component set rather than as a timeless absolute score.
Execution time on this page refers to wall-clock task runtime per task, not just raw model latency. It is meant to reflect the user-facing time cost of running the whole agent workflow. That includes time spent reasoning, issuing tool calls, reading and writing files, executing shell steps, and waiting on model responses. So an agent can have a fast underlying model and still be slower overall if its workflow is longer or more tool-heavy.
Token usage is the average observed token consumption per task across the benchmark suite. On this page we break it out into input, cache, and output tokens. Input tokens are the tokens sent into the model, including prompts, instructions, tool context, and task context. Cache tokens are prompt tokens reused through prompt caching when the provider exposes that telemetry. Output tokens are tokens generated by the model in its response. Token usage matters because it often drives cost and can also indicate how much context an agent consumes to get work done, but token efficiency and cost are not identical because providers price token categories differently and caching can materially change the bill.
A higher index score means stronger performance across the included benchmark mix, but it does not mean the agent is best for every workflow. The index is a balance across benchmark quality, not a direct measure of your specific latency, cost, tooling, or task-type priorities. Real-world choice still depends on whether your workflow looks more like repository Q&A, patching, or terminal execution, and on practical constraints such as IDE integration, model availability, and reliability.
These benchmarks measure coding-agent performance across repositories, tools, multi-step workflows, and evaluator-based outcomes. Results on this page reflect specific evaluated agent variants, not just generic product names: model choice, settings, and execution configuration can materially change outcomes, which is why a single agent family may appear in multiple variants in the results. For more background on benchmark runs, task-level scoring, and methodology, see the coding-agents benchmarking methodology page. View the coding-agents benchmarking methodology