Codex vs. Opencode

Name: Coding Agent Index
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

Comparison between Codex and Opencode across the Artificial Analysis Coding Agent Index, including benchmark scores, cost, execution time, and token usage.

For details relating to our methodology, see our methodology page.

Explore other comparisons

Highlights

Coding Agent Index

Artificial Analysis Coding Agent Index v1.3 · Higher is better

Time per Task

Average agent wall time per task · Lower is better

Cost per Task

Average API cost per task (USD) · Lower is better

Comparison

Side-by-side comparison of Codex and Opencode.

Coding Agent Comparison

Metric	Codex GPT-5.6 Sol (max)	Opencode Muse Spark 1.1 (xhigh)	Analysis
Agent Harness	Codex	Opencode
Representative Model	GPT-5.6 Sol (max)	Muse Spark 1.1 (xhigh)
Coding Agent Index	67	54	Codex has a higher Coding Agent Index than Opencode
DeepSWE	69%	54%	Codex has a higher DeepSWE score than Opencode
Terminal-Bench v2	88%	73%	Codex has a higher Terminal-Bench v2 score than Opencode
SWE-Atlas-QnA	43%	33%	Codex has a higher SWE-Atlas-QnA score than Opencode
Cost per Task	$7.08	$1.43	Opencode has a lower cost per task than Codex
Time per Task	10.2m	12.6m	Codex has a lower time per task than Opencode
Turns per Task	114.2	55.4	Opencode has a lower turns per task than Codex
Token Usage per Task	13.2M	12.2M	Opencode has a lower token usage per task than Codex
Cache Hit Rate	90%	95%	Opencode has a higher cache hit rate than Codex

Model Variants

Evaluated model variants for Codex and Opencode.

Model Variants


Codex	GPT-5.6 Sol (max)	67	69%	88%	43%	$7.08	10.2m	13.2M
Codex	GPT-5.6 Sol (xhigh)	65	67%	86%	42%	$5.24	7.4m	9.9M
Codex	GPT-5.6 Sol (high)	64	65%	83%	45%	$4.14	6.3m	8.1M
Codex	GPT-5.6 Terra (max)	62	67%	84%	36%	$2.76	8.4m	9.5M
Codex	GPT-5.5 (xhigh)	61	64%	84%	36%	$5.07	10.1m	12.3M
Codex	GPT-5.6 Sol (medium)	61	64%	78%	40%	$2.99	5.2m	5.8M
Codex	GPT-5.6 Luna (max)	59	63%	80%	33%	$1.57	8.0m	15.5M
Codex	GPT-5.6 Terra (xhigh)	57	58%	81%	32%	$1.90	6.9m	6.5M
Codex	GPT-5.6 Terra (high)	56	60%	76%	31%	$1.59	6.2m	5.5M
Codex	GPT-5.6 Luna (xhigh)	55	57%	76%	31%	$1.26	6.6m	12.3M
Codex	GPT-5.5 (medium)	54	57%	76%	31%	$2.75	6.4m	7M
Codex	GPT-5.6 Sol (low)	54	53%	73%	34%	$1.72	3.7m	3.2M
Codex	GPT-5.6 Luna (high)	51	53%	72%	29%	$0.96	5.7m	9.5M
Codex	GPT-5.6 Terra (medium)	48	46%	69%	28%	$0.90	4.3m	3.1M
Codex	GPT-5.6 Sol (none)	43	35%	61%	34%	$1.40	3.4m	3.4M
Codex	GPT-5.6 Luna (medium)	42	37%	63%	27%	$0.47	3.4m	4.4M
Codex	GPT-5.4 (medium)	39	25%	70%	22%	$2.42	7.1m	5.9M
Codex	GPT-5.6 Terra (low)	37	30%	58%	23%	$0.48	2.8m	1.5M
Codex	GPT-5.6 Luna (low)	25	10%	50%	15%	$0.21	1.9m	1.5M
Codex	GPT-5.6 Terra (none)	24	13%	39%	19%	$0.37	1.8m	1.1M
Codex	GPT-5.6 Luna (none)	20	6%	37%	17%	$0.35	2.5m	3.6M
Opencode	Muse Spark 1.1 (xhigh)	54	54%	73%	33%	$1.43	12.6m	12.2M
Opencode	Opus 4.7 (medium)	50	40%	75%	35%	$2.93	12.2m	7.6M

Performance

Performance across the Artificial Analysis Coding Agent Index.

Artificial Analysis Coding Agent Index

Artificial Analysis Coding Agent Index v1.3 incorporates 3 benchmarks: DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA · Higher is better

Color by

The Artificial Analysis Coding Agent Index is a composite score built from DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

It is useful for quick comparison, but it should be read alongside the per-eval breakdowns. Two agents with similar index values can still have different strengths across repository tasks, terminal workflows, and rubric-based evaluations.

Token Usage

Token consumption across the Artificial Analysis Coding Agent Index.

Token Usage per Task

Average input, cache, and output tokens per task

Prompt cache hit rates can vary significantly by provider routing, which can materially change effective cost.

Non-cached input tokens sent to the model, including prompts, instructions, tool context, and task context that were not served from prompt cache.

Reused prompt tokens billed through provider prompt caching when that telemetry is available, rather than being processed as a fully fresh input each time.

Some providers route repeated requests across different backend replicas. When prompt cache state is not shared consistently across those replicas, a model may receive fewer cache hits even when the benchmark task flow is otherwise identical.

We do not add custom relay headers or provider-specific affinity controls to force higher cache reuse, because that would make the benchmark less representative of a typical user setup. As a result, reported costs reflect the cache behavior observed through the configured provider path, not an optimized best-case cache scenario.

Tokens returned by the model in its visible response during the task.

Artificial Analysis Coding Agent Index vs. Total Tokens

Artificial Analysis Coding Agent Index vs. average total tokens per task

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower token usage appears farther left. Agents toward the upper-left use fewer tokens for a given level of performance.

Cost

Pay-per-token API cost across the Artificial Analysis Coding Agent Index, based on current per-token pricing.

Cost per Task

Average pay-per-token API cost per task (USD) · Lower is better

Color by

This chart shows the average pay-per-token API cost per task across the Artificial Analysis Coding Agent Index, spanning DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

Where applicable, that cost model includes standard input pricing, discounted cached-input pricing, separate cache-write charges, and output pricing rather than treating all prompt tokens as if they were billed at the same uncached input rate.

It is intended to show pay-per-token API cost, not consumer plan pricing or the full operational cost of deploying the system in production. Infrastructure, engineering, and supervision costs are not the focus of this metric.

Artificial Analysis Coding Agent Index vs. Cost per Task

Artificial Analysis Coding Agent Index vs. average pay-per-token API cost per task (USD)

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means lower average cost per task. The most efficient agents sit toward the lower-right: stronger results at lower cost.

Execution Time

Active agent runtime across the Artificial Analysis Coding Agent Index.

Time per Task

Average agent wall time per task · Lower is better

Color by

This chart uses agent wall time: how long the agent process was actively running on each task.

It does not include environment startup, verifier or judge time, or other harness overhead, so it is a cleaner comparison of how long the agent itself was working.

Artificial Analysis Coding Agent Index vs. Execution Time

Artificial Analysis Coding Agent Index vs. average agent wall time per task

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means shorter average agent runtime per task. Agents toward the lower-right deliver stronger results in less active agent time.

Codex vs. Opencode

Coding Agent Index

Time per Task

Cost per Task

Comparison

Coding Agent Comparison

Model Variants

Model Variants

Performance

Artificial Analysis Coding Agent Index

What This Metric Means

Token Usage

Token Usage per Task

Input Tokens

Cached Input Tokens

Why Cache Usage Varies

Output Tokens

Artificial Analysis Coding Agent Index vs. Total Tokens

How to Read This Chart

Cost

Cost per Task

What Cost Is Measuring

Artificial Analysis Coding Agent Index vs. Cost per Task

How to Read This Chart

Execution Time

Time per Task

What Execution Time Is Measuring

Artificial Analysis Coding Agent Index vs. Execution Time

How to Read This Chart