Coding Agent Index Methodology
Overview
Artificial Analysis benchmarks coding agents on end-to-end software engineering tasks. The goal is to measure how well agents complete realistic coding work, and how performance varies across outcome, reliability, token usage, cost, and execution time.
Public results on the Coding Agent Index page are built from task-level benchmark attempts and are aggregated into per-evaluation scores, pooled efficiency metrics, and the Artificial Analysis Coding Agent Index.
This page focuses on how the public Artificial Analysis Coding Agent Index is constructed, what benchmark components are currently included, and how the public pass@1, cost, token-usage, and execution-time metrics are derived.
Artificial Analysis Coding Agent Index
The current public Artificial Analysis Coding Agent Index is a composite benchmark score built from the configured benchmark components in the public coding-agents suite.
The point of the index is not to collapse all coding work into one benchmark task type. Different coding agents can perform very differently on repository Q&A, implementation and bug-fix tasks, and terminal-heavy workflows. The index exists to summarize those different benchmark families into one top-level performance view while preserving the per-benchmark breakdowns underneath.
Index Components
The current public index includes the following benchmark components:
| Evaluation | Field | Tasks | Repeats | Response Type | Scoring |
|---|---|---|---|---|---|
| DeepSWE | Long-Horizon Software Engineering | 113 | 3 | Code patch / repository changes | Program verifier pass/fail, pass@1 |
| Terminal-Bench v2 | Agentic Terminal Use | 84* | 3 | Terminal-based task execution | Test suite pass/fail, pass@1 |
| SWE-Atlas-QnA | Repository Q&A | 124 | 3 | Open Answer | Rubric-based grading, pass@1 |
* Terminal-Bench v2 originally contains 89 tasks; we exclude five tasks because of environment compatibility issues.
Evaluated Tasks
The current public index covers 321 evaluated tasks across the 3 benchmark components.
- Add iterable collection combinators to true-myth
- Abort pending body reads on shutdown
- Add rolling min, max, median, and quantile methods
- Format BigQuery pipe syntax queries correctly
- Add unified manifest stream output across Helm commands
- Add a deterministic CookieStore with modern Set-Cookie parsing
- Preserve restored query state in persisted snapshots
- Add an error-accumulating Validated container
- Add a persistent analysis cache to Vulture
- Add multi-module memory snapshots to wazero
- Add JSONPath query APIs to orderedmap and Starlark modules
- Add a per-origin circuit breaker to ofetch
- Add stepped slices for arrays and strings
- Add `matchEach` to ts-pattern
- Harden module loading, cache introspection, and script flags
- Add action pinning linting for actions and reusable workflows
- Add typed window function builders with OVER clauses
- Add typed blend range access and blend-if compositing
- Add deterministic map conflict detection to Y.Map writes
- Add dependency-aware async initialization to the container
- Add multipart response parsing to HTTPX
- Add scoped per-rule ignore markers to Obsidian Linter
- Partition report files by launcher and expand report templates
- Add keyset cursor pagination to `$find`
- Add ShapeIndex encoding and decoding
- Add entity snapshot and rollback APIs to Koota
- Restore RichLog follow-state parity and expand reflow behavior
- Add duration-aware sharding to Vitest
- Add grouped test phases with synchronized barriers
- Add bidirectional TOML table converters
- Add go:embed directive support for interpreted packages
- Add task snapshots, inspection, and diffing to aiomonitor
- Add drift detection and compliance baselines
- Expose accumulated streamed function-call args in SDK surfaces
- Add grouping-set and window-frame SQL helpers
- Add rule evaluation profiling to Rego
- Add typed variable bindings to Anko
- Add multiplexed ordered streams over KCP
- Add single-active-consumer priority and cancel tracking to virtual transports
- Reconstruct template strings in partial evaluation output
- Add bounded-memory spilling to SCC aggregation
- Fix isolated Go-side calls for Tengo callables and closures
- Add interprocedural taint checks for Bandit injection sinks
- Add deterministic multi-key sorting to fd
- Add task graph export with JSON, DOT, and text output
- Add default arguments to Anko function parameters
- Add deprecation, sunset, and successor headers to FastAPI routes
- Add a checker for broken doc comment links
- Add boundary modes to `@stencil`
- Add retry-aware publishing audit logs
- Add method declarations and interface dispatch to Scriggo
- Add partial structuring with error recovery to cattrs
- Add conditional required attributes to schemas
- Add worktree merge conflict handling
- Add session bundle recording and replay to IPython
- Add flattened dataclass fields to Mashumaro field options
- Add build-time grammar conflict analysis to participle
- Add XML diff, patch, and merge operations to etree
- Add streaming JSON iteration to HTTPX responses
- Validate daemon watch, status, and log lifecycle
- Add recursive schema composition to Valibot
- Add input key aliases to name mapping
- Add durability callbacks and wait APIs for sync writes
- Add duration encoding to TableVectorizer
- Add safe import checkpoints and invariant validation
- Add shorthand expansion and compression to the lexer
- Add RFC 5545 timezone interoperability to dateutil recurrence parsing
- Add atomic signal selectors to Kea
- Add consistent hash policy support to TrafficPolicy
- Fix PromQL label sorting across typed and untyped values
- Format CREATE TABLE DDL and add DDL parsing helpers
- Add trap coredump generation to wasmi
- Add hierarchical evaluation cancellation to Boa
- Add value-based query predicates to Koota
- Add request coalescing to `Runnable`
- Add explicit resource management declarations to the parser
- Add lazy recursive schemas with DTO and JSON Schema export
- Persist the fitted feature schema across evaluate, predict, serve, and export
- Add composite trait aspects to Koota
- Add tube multiplexing to pwntools
- Add scoped state data to state machine callbacks and history
- Add destructuring bindings to Tengo
- Add JSON Schema refs and dependency keywords
- Add link format conversion between wiki and markdown syntax
- Add implicit HEAD and automatic OPTIONS responses to FastAPI routes
- Add transparent encryption to dump uploads
- Add conditional option dependencies to Optique
- Preserve structure needed by stylesheet selectors
- Add policy-based alerting for failures, latency, and SSL expiry
- Add incremental cache controls to Bandit
- Coalesce qualifying choices into character classes
- Preserve ANSI resets during truncation and styling
- Add async autocomplete options and fetch lifecycle handling
- Add config file parsing to Cliffy commands
- Add HTML document format handling to Dasel
- Add CSS Grid layout to the Box component
- Add `\multicolumn` column spans to array-like environments
- Add a deferred mutation buffer to batch entity changes
- Add automatic table of contents generation for Obsidian linter
- Implement a deterministic IntersectionObserver in Happy DOM
- Add pair-level relation tracking modifiers
- Add error stack serialization to SuperJSON
- Complete Kitty keyboard phases and stable fallback key metadata
- Add structured nosec directives for regions and next line
- Add bail-on-test-failure handling to Testem
- Add try/catch error recovery to expr
- Add configurable array merge strategies to Helm value coalescing
- Implement recursive agent delegation through delegate_task tool calls
- Add SSE streaming endpoints to HttpApi
- Add transactional reload status and rollback tracking to Prometheus
- Add GraphQL incremental delivery with @defer and @stream
- Add dead-lettering, TTL, and overflow handling to virtual queues
- Reuse one toolbar across multiple Quill editors
- adaptive-rejection-sampler
- bn-fit-modify
- break-filter-js-from-html
- build-cython-ext
- build-pmars
- build-pov-ray
- caffe-cifar-10
- cancel-async-tasks
- chess-best-move
- circuit-fibsqrt
- cobol-modernization
- code-from-image
- compile-compcert
- configure-git-webserver
- constraints-scheduling
- count-dataset-tokens
- crack-7z-hash
- custom-memory-heap-crash
- db-wal-recovery
- distribution-search
- dna-assembly
- dna-insert
- extract-elf
- extract-moves-from-video
- feal-differential-cryptanalysis
- financial-document-processor
- fix-code-vulnerability
- fix-git
- fix-ocaml-gc
- gcode-to-text
- git-leak-recovery
- git-multibranch
- headless-terminal
- hf-model-inference
- install-windows-3.11
- kv-store-grpc
- large-scale-text-editing
- largest-eigenval
- llm-inference-batching-scheduler
- log-summary-date-ranges
- mailman
- make-mips-interpreter
- mcmc-sampling-stan
- merge-diff-arc-agi-task
- model-extraction-relu-logits
- modernize-scientific-stack
- mteb-leaderboard
- mteb-retrieve
- multi-source-data-merger
- nginx-request-logging
- openssl-selfsigned-cert
- overfull-hbox
- password-recovery
- path-tracing
- path-tracing-reverse
- polyglot-c-py
- polyglot-rust-c
- portfolio-optimization
- protein-assembly
- prove-plus-comm
- pypi-server
- pytorch-model-cli
- pytorch-model-recovery
- qemu-alpine-ssh
- qemu-startup
- query-optimize
- raman-fitting
- regex-chess
- regex-log
- reshard-c4-data
- rstan-to-pystan
- sam-cell-seg
- sanitize-git-repo
- schemelike-metacircular-eval
- sparql-university
- sqlite-db-truncate
- sqlite-with-gcov
- torch-pipeline-parallelism
- torch-tensor-parallelism
- train-fasttext
- tune-mjcf
- video-processing
- vulnerable-secret
- winning-avg-corewars
- 6905333b74f22949d97ba998
- 6905333b74f22949d97ba999
- 6905333b74f22949d97ba99a
- 6905333b74f22949d97ba99b
- 6905333b74f22949d97ba99d
- 6905333b74f22949d97ba99f
- 6905333b74f22949d97ba9a2
- 6905333b74f22949d97ba9a3
- 6905333b74f22949d97ba9a4
- 6905333b74f22949d97ba9a5
- 6905333b74f22949d97ba9a6
- 6905333b74f22949d97ba9a7
- 6905333b74f22949d97ba9a8
- 6905333b74f22949d97ba9a9
- 6905333b74f22949d97ba9aa
- 6905333b74f22949d97ba9ab
- 6905333b74f22949d97ba9ac
- 6905333b74f22949d97ba9ad
- 6905333b74f22949d97ba9ae
- 6905333b74f22949d97ba9af
- 6905333b74f22949d97ba9b1
- 6905333b74f22949d97ba9b2
- 6905333b74f22949d97ba9b3
- 6905333b74f22949d97ba9b5
- 6905333b74f22949d97ba9b6
- 6905333b74f22949d97ba9b7
- 6905333b74f22949d97ba9b8
- 6905333b74f22949d97ba9ba
- 6905333b74f22949d97ba9bb
- 6905333b74f22949d97ba9bc
- 6905333b74f22949d97ba9bd
- 6905333b74f22949d97ba9be
- 6905333b74f22949d97ba9bf
- 6905333b74f22949d97ba9c0
- 6905333b74f22949d97ba9c1
- 6905333b74f22949d97ba9c2
- 6905333b74f22949d97ba9c3
- 6905333b74f22949d97ba9c4
- 6905333b74f22949d97ba9c5
- 6905333b74f22949d97ba9c6
- 6905333b74f22949d97ba9c8
- 6905333b74f22949d97ba9c9
- 6905333b74f22949d97ba9ca
- 6905333b74f22949d97ba9cb
- 6905333b74f22949d97ba9cc
- 6905333b74f22949d97ba9cd
- 6905333b74f22949d97ba9ce
- 6905333b74f22949d97ba9cf
- 6905333b74f22949d97ba9d0
- 6905333b74f22949d97ba9d1
- 6905333b74f22949d97ba9d2
- 6905333b74f22949d97ba9d3
- 6905333b74f22949d97ba9d4
- 6905333b74f22949d97ba9d5
- 6905333b74f22949d97ba9d6
- 6905333b74f22949d97ba9d7
- 6905333b74f22949d97ba9d8
- 6905333b74f22949d97ba9d9
- 6905333b74f22949d97ba9db
- 6905333b74f22949d97ba9dc
- 6905333b74f22949d97ba9dd
- 6905333b74f22949d97ba9de
- 6905333b74f22949d97ba9e0
- 6905333b74f22949d97ba9e1
- 6905333b74f22949d97ba9e3
- 6905333b74f22949d97ba9e4
- 6905333b74f22949d97ba9e5
- 6905333b74f22949d97ba9e7
- 6905333b74f22949d97ba9e8
- 6905333b74f22949d97ba9e9
- 6905333b74f22949d97ba9eb
- 6905333b74f22949d97ba9ee
- 6905333b74f22949d97ba9f0
- 6905333b74f22949d97ba9f1
- 6905333b74f22949d97ba9f2
- 6905333b74f22949d97ba9f4
- 6905333b74f22949d97ba9f5
- 6905333b74f22949d97ba9f7
- 6905333b74f22949d97ba9f8
- 6905333b74f22949d97ba9f9
- 6905333b74f22949d97ba9fa
- 6905333b74f22949d97ba9fb
- 6905333b74f22949d97ba9fc
- 6905333b74f22949d97ba9fd
- 6905333b74f22949d97ba9ff
- 6905333b74f22949d97baa01
- 6905333b74f22949d97baa02
- 6905333b74f22949d97baa03
- 6905333b74f22949d97baa04
- 6905333b74f22949d97baa05
- 6905333b74f22949d97baa06
- 6905333b74f22949d97baa07
- 6905333b74f22949d97baa09
- 6905333b74f22949d97baa0b
- 6905333b74f22949d97baa0c
- 6905333b74f22949d97baa0d
- 6905333b74f22949d97baa0f
- 6905333b74f22949d97baa10
- 6905333b74f22949d97baa11
- 6905333b74f22949d97baa12
- 6905333b74f22949d97baa14
- 6905333b74f22949d97baa15
- 6905333b74f22949d97baa16
- 6905333b74f22949d97baa17
- 6905333b74f22949d97baa19
- 6905333b74f22949d97baa1a
- 6905333b74f22949d97baa1b
- 6905333b74f22949d97baa1c
- 6905333b74f22949d97baa1d
- 6905333b74f22949d97baa1e
- 6905333b74f22949d97baa1f
- 6905333b74f22949d97baa20
- 6905333b74f22949d97baa21
- 6905333b74f22949d97baa22
- 6905333b74f22949d97baa23
- 6905333b74f22949d97baa24
- 6905333b74f22949d97baa25
- 6905333b74f22949d97baa26
- 6905333b74f22949d97baa27
- 6905333b74f22949d97baa28
- 6905333b74f22949d97baa2a
- 6905333b74f22949d97baa2b
- 6905333b74f22949d97baa2c
- 6905333b74f22949d97baa2d
What The Index Aggregates
For each agent variant, Artificial Analysis computes a pass@1 score for each included benchmark component and then aggregates those component scores into the public index.
The same benchmark suite also underlies the public pooled efficiency metrics on the benchmark page, including cost to run, token usage, and execution time. That means the performance and efficiency views are aligned to the same underlying benchmark coverage rather than being drawn from unrelated runs.
Scoring And Outcomes
pass@1 Results
Each evaluated attempt receives a pass@1 result from the benchmark evaluator. Test-suite evaluations are scored as pass or fail, while rubric-based evaluations can award partial credit.
| Term | Definition |
|---|---|
| Binary pass@1 | A test-suite evaluation result where a task receives either 1 for pass or 0 for fail. |
| Partial-credit pass@1 | A rubric-based evaluation result where a task can receive any score between 0 and 1. |
Per-Evaluation Scores
For each evaluation, the public benchmark score is the average of the task-level pass@1 results for a given agent variant. When an evaluation uses multiple repeats, those repeat results are included in the same average.
Efficiency Metrics
Cost, token usage, and execution time are reported as pooled per-task-attempt means across the current public coding-agents benchmark suite.
- Cost to run: average pay per token API cost per task, based on provider token pricing rather than consumer plans.
- Token usage: average input, cache, cache-write, reasoning, and output tokens per task.
- Execution time: average wall-clock runtime per task, including full task wall time and the agent wall-time subset where available.
Where telemetry is missing for a given metric, those missing values are excluded from the corresponding average rather than treated as zero.
In the cost metric, cached input is treated separately from uncached input where provider pricing supports that distinction, and cache-write charges are included when providers bill for creating prompt cache state. This is intended to reflect pay per token API pricing more closely than a flat per-token estimate.
Agent Settings
Public benchmark rows represent agent variants, not just model names. Settings that can change behavior are kept distinct in reporting.
Unless otherwise specified, we use each agent's default reasoning settings so the benchmark reflects the default user experience.
Benchmarking methodology may evolve over time as new evaluations and agent variants are added, but public comparisons are intended to reflect like-for-like agent variants within the published benchmark suite.
Version History
Version 1.1
June 2026—current
- Added DeepSWE (long-horizon software engineering)
- Removed SWE-Bench-Pro-Hard-AA from Coding Agent Index
Version 1.0
May 2026—June 2026
- Initial release with SWE-Bench-Pro-Hard-AA (code generation), Terminal-Bench v2 (agentic terminal use), and SWE-Atlas-QnA (repository Q&A)