Claude Code vs. Gemini CLI

Name: Coding Agent Index
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

Comparison between Claude Code and Gemini CLI across the Artificial Analysis Coding Agent Index, including benchmark scores, cost, execution time, and token usage.

For details relating to our methodology, see our methodology page.

Explore other comparisons

Highlights

Coding Agent Index

Artificial Analysis Coding Agent Index v1.3 · Higher is better

Time per Task

Average agent wall time per task · Lower is better

Cost per Task

Average API cost per task (USD) · Lower is better

Comparison

Side-by-side comparison of Claude Code and Gemini CLI.

Coding Agent Comparison

Metric	Claude Code Fable 5 (max) (with fallback)	Gemini CLI Gemini 3.1 Pro (high)	Analysis
Agent Harness	Claude Code	Gemini CLI
Representative Model	Fable 5 (max) (with fallback)	Gemini 3.1 Pro (high)
Coding Agent Index	66	30	Claude Code has a higher Coding Agent Index than Gemini CLI
DeepSWE	66%	14%	Claude Code has a higher DeepSWE score than Gemini CLI
Terminal-Bench v2	83%	68%	Claude Code has a higher Terminal-Bench v2 score than Gemini CLI
SWE-Atlas-QnA	49%	9%	Claude Code has a higher SWE-Atlas-QnA score than Gemini CLI
Cost per Task	$11.71	$2.00	Gemini CLI has a lower cost per task than Claude Code
Time per Task	23.4m	10.8m	Gemini CLI has a lower time per task than Claude Code
Turns per Task	137.8	30.7	Gemini CLI has a lower turns per task than Claude Code
Token Usage per Task	14M	4.7M	Gemini CLI has a lower token usage per task than Claude Code
Cache Hit Rate	96%	87%	Claude Code has a higher cache hit rate than Gemini CLI

Model Variants

Evaluated model variants for Claude Code and Gemini CLI.

Model Variants


Claude Code	Fable 5 (max) (with fallback)	66	66%	83%	49%	$11.71	23.4m	14M
Claude Code	Opus 4.8 (max)	61	56%	79%	47%	$7.70	23.1m	18M
Claude Code	Opus 4.8 (medium)	54	49%	75%	36%	$3.26	12.4m	7.8M
Claude Code	Opus 4.7 (max)	50	40%	74%	37%	$5.63	15.7m	16M
Claude Code	GLM-5.2	43	29%	72%	29%	$6.51	25.1m	6.5M
Claude Code	Opus 4.7 (medium)	40	27%	71%	23%	$1.68	6.3m	4.6M
Claude Code	Sonnet 4.6 (medium)	38	29%	64%	20%	$2.01	13.5m	8.5M
Claude Code	GLM-5.1	36	19%	65%	25%	$4.33	19.6m	25.9M
Claude Code	Qwen3.7 Plus (thinking)	36	19%	65%	24%	$6.23	10.6m	8.7M
Claude Code	Kimi K2.6	33	17%	65%	16%	$1.19	41.0m	11.5M
Claude Code	DeepSeek V4 Pro (high)	31	9%	66%	20%	$0.27	17.9m	9.8M
Claude Code	Opus 4.6 (medium)	46	-	71%	22%	$1.28	8.0m	4.5M
Gemini CLI	Gemini 3.1 Pro (high)	30	14%	68%	9%	$2.00	10.8m	4.7M

Performance

Performance across the Artificial Analysis Coding Agent Index.

Artificial Analysis Coding Agent Index

Artificial Analysis Coding Agent Index v1.3 incorporates 3 benchmarks: DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA · Higher is better

Color by

The Artificial Analysis Coding Agent Index is a composite score built from DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

It is useful for quick comparison, but it should be read alongside the per-eval breakdowns. Two agents with similar index values can still have different strengths across repository tasks, terminal workflows, and rubric-based evaluations.

Token Usage

Token consumption across the Artificial Analysis Coding Agent Index.

Token Usage per Task

Average input, cache, and output tokens per task

Prompt cache hit rates can vary significantly by provider routing, which can materially change effective cost.

Non-cached input tokens sent to the model, including prompts, instructions, tool context, and task context that were not served from prompt cache.

Reused prompt tokens billed through provider prompt caching when that telemetry is available, rather than being processed as a fully fresh input each time.

Some providers route repeated requests across different backend replicas. When prompt cache state is not shared consistently across those replicas, a model may receive fewer cache hits even when the benchmark task flow is otherwise identical.

We do not add custom relay headers or provider-specific affinity controls to force higher cache reuse, because that would make the benchmark less representative of a typical user setup. As a result, reported costs reflect the cache behavior observed through the configured provider path, not an optimized best-case cache scenario.

Tokens returned by the model in its visible response during the task.

Artificial Analysis Coding Agent Index vs. Total Tokens

Artificial Analysis Coding Agent Index vs. average total tokens per task

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower token usage appears farther left. Agents toward the upper-left use fewer tokens for a given level of performance.

Cost

Pay-per-token API cost across the Artificial Analysis Coding Agent Index, based on current per-token pricing.

Cost per Task

Average pay-per-token API cost per task (USD) · Lower is better

Color by

This chart shows the average pay-per-token API cost per task across the Artificial Analysis Coding Agent Index, spanning DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA.

Where applicable, that cost model includes standard input pricing, discounted cached-input pricing, separate cache-write charges, and output pricing rather than treating all prompt tokens as if they were billed at the same uncached input rate.

It is intended to show pay-per-token API cost, not consumer plan pricing or the full operational cost of deploying the system in production. Infrastructure, engineering, and supervision costs are not the focus of this metric.

Artificial Analysis Coding Agent Index vs. Cost per Task

Artificial Analysis Coding Agent Index vs. average pay-per-token API cost per task (USD)

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means lower average cost per task. The most efficient agents sit toward the lower-right: stronger results at lower cost.

Execution Time

Active agent runtime across the Artificial Analysis Coding Agent Index.

Time per Task

Average agent wall time per task · Lower is better

Color by

This chart uses agent wall time: how long the agent process was actively running on each task.

It does not include environment startup, verifier or judge time, or other harness overhead, so it is a cleaner comparison of how long the agent itself was working.

Artificial Analysis Coding Agent Index vs. Execution Time

Artificial Analysis Coding Agent Index vs. average agent wall time per task

Color by

Most attractive quadrant

Each point represents a coding-agent variant. Farther right means higher benchmark performance, while lower on the chart means shorter average agent runtime per task. Agents toward the lower-right deliver stronger results in less active agent time.

Claude Code vs. Gemini CLI

Coding Agent Index

Time per Task

Cost per Task

Comparison

Coding Agent Comparison

Model Variants

Model Variants

Performance

Artificial Analysis Coding Agent Index

What This Metric Means

Token Usage

Token Usage per Task

Input Tokens

Cached Input Tokens

Why Cache Usage Varies

Output Tokens

Artificial Analysis Coding Agent Index vs. Total Tokens

How to Read This Chart

Cost

Cost per Task

What Cost Is Measuring

Artificial Analysis Coding Agent Index vs. Cost per Task

How to Read This Chart

Execution Time

Time per Task

What Execution Time Is Measuring

Artificial Analysis Coding Agent Index vs. Execution Time

How to Read This Chart