Codex vs. Cursor CLI

Comparison between Codex and Cursor CLI across the Artificial Analysis Coding Agent Index, including benchmark scores, cost, execution time, and token usage.

For details relating to our methodology, see our methodology page.

Explore other comparisons
vs

Highlights

Updated
Artificial Analysis Coding Agent Index · Higher is better
Mean agent wall time per task · Lower is better
Mean API cost per task (USD) · Lower is better

Comparison

Side-by-side comparison of Codex and Cursor CLI.

Coding Agent Comparison

Metric
Codex
GPT-5.5 (xhigh)
Cursor CLI
GPT-5.5 (medium)
Analysis
Agent Harness
Codex
Cursor CLI
Representative Model
GPT-5.5 (xhigh)
GPT-5.5 (medium)
Coding Agent Index
76
62
Codex has a higher Coding Agent Index than Cursor CLI
DeepSWE
64%
37%
Codex has a higher DeepSWE score than Cursor CLI
Terminal-Bench v2
84%
73%
Codex has a higher Terminal-Bench v2 score than Cursor CLI
SWE-Atlas-QnA
81%
75%
Codex has a higher SWE-Atlas-QnA score than Cursor CLI
Cost per Task
$5.07
$2.01
Cursor CLI has a lower cost per task than Codex
Time per Task
10.1m
6.6m
Cursor CLI has a lower time per task than Codex
Turns per Task
106.3
78
Cursor CLI has a lower turns per task than Codex
Token Usage per Task
12.3M
4M
Cursor CLI has a lower token usage per task than Codex
Cache Hit Rate
94%
89%
Codex has a higher cache hit rate than Cursor CLI

Model Variants

Evaluated model variants for Codex and Cursor CLI.

Model Variants

Codex
GPT-5.5 (xhigh)
76
64%
84%
81%
$5.07
10.1m
12.3M
Codex
GPT-5.5 (medium)
71
57%
76%
79%
$2.75
6.4m
7M
Codex
GPT-5.4 (medium)
71
-
70%
72%
$2.27
7.1m
5.7M
Cursor CLI
GPT-5.5 (medium)
62
37%
73%
75%
$2.01
6.6m
4M
Cursor CLI
Opus 4.7 (medium)
60
32%
71%
78%
$2.68
13.6m
5.7M
Cursor CLI
Composer 2.5
52
16%
67%
72%
$0.08
9.7m
3.6M
Cursor CLI
Composer 2.5 Fast
52
16%
67%
72%
$0.55
6.8m
4.3M
Cursor CLI
GPT-5.4 (medium)
69
-
65%
73%
$1.52
8.3m
3.8M
Cursor CLI
Composer 2
67
-
64%
69%
$0.04
8.6m
2.9M

Performance

Performance across the Artificial Analysis Coding Agent Index.

Artificial Analysis Coding Agent Index

Composite average pass@1 across DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA · Higher is better

The Artificial Analysis Coding Agent Index is a composite score built from DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

It is useful for quick comparison, but it should be read alongside the per-eval breakdowns. Two agents with similar index values can still have different strengths across repository tasks, terminal workflows, and rubric-based evaluations.

Token Usage

Token consumption across the Artificial Analysis Coding Agent Index.

Token Usage per Task

Mean input, cache, and output tokens per task
Prompt cache hit rates can vary significantly by provider routing, which can materially change effective cost.

Non-cached input tokens sent to the model, including prompts, instructions, tool context, and task context that were not served from prompt cache.

Reused prompt tokens billed through provider prompt caching when that telemetry is available, rather than being processed as a fully fresh input each time.

Some providers route repeated requests across different backend replicas. When prompt cache state is not shared consistently across those replicas, a model may receive fewer cache hits even when the benchmark task flow is otherwise identical.

We do not add custom relay headers or provider-specific affinity controls to force higher cache reuse, because that would make the benchmark less representative of a typical user setup. As a result, reported costs reflect the cache behavior observed through the configured provider path, not an optimized best-case cache scenario.

Tokens returned by the model in its visible response during the task.

Artificial Analysis Coding Agent Index vs. Total Tokens

Artificial Analysis Coding Agent Index vs. mean total tokens per task
Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower token usage appears farther left. Agents toward the upper-left use fewer tokens for a given level of performance.

Cost

Pay-per-token API cost across the Artificial Analysis Coding Agent Index, based on current per-token pricing.

Cost per Task

Mean pay-per-token API cost per task (USD) · Lower is better

This chart shows the mean pay-per-token API cost per task across the Artificial Analysis Coding Agent Index, spanning DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

Where applicable, that cost model includes standard input pricing, discounted cached-input pricing, separate cache-write charges, and output pricing rather than treating all prompt tokens as if they were billed at the same uncached input rate.

It is intended to show pay-per-token API cost, not consumer plan pricing or the full operational cost of deploying the system in production. Infrastructure, engineering, and supervision costs are not the focus of this metric.

Artificial Analysis Coding Agent Index vs. Cost per Task

Artificial Analysis Coding Agent Index vs. mean pay-per-token API cost per task (USD)
Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means lower mean cost per task. The most efficient agents sit toward the lower-right: stronger results at lower cost.

Execution Time

Active agent runtime across the Artificial Analysis Coding Agent Index.

Time per Task

Mean agent wall time per task · Lower is better

This chart uses agent wall time: how long the agent process was actively running on each task.

It does not include environment startup, verifier or judge time, or other harness overhead, so it is a cleaner comparison of how long the agent itself was working.

Artificial Analysis Coding Agent Index vs. Execution Time

Artificial Analysis Coding Agent Index vs. mean agent wall time per task
Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means shorter mean agent runtime per task. Agents toward the lower-right deliver stronger results in less active agent time.