All capability indexes

Coding Index

Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.

The headline score is the average of the benchmarks listed below. Each row links to its result chart further down the page when one is available, or out to the underlying benchmark.

  • Terminal-Bench Hard

    An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.

  • SciCode

    A scientist-curated coding benchmark featuring 288 test set subproblems from 80 laboratory problems across 16 scientific disciplines.

Results

Coding Index

Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.
Reasoning models are indicated by a lightbulb icon

Release Date

Coding Index vs. Release Date

Most attractive quadrant

Token Usage

Coding Index: Output Token Composition

Tokens used to run the evaluation

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

Coding Index: Cost Breakdown

Cost (USD) to run the evaluation

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Frequently Asked Questions

The Coding Index is a composite benchmark from Artificial Analysis that evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.

The Coding Index is calculated as the average of its underlying benchmark scores, normalised to a 0–100 scale.

The Coding Index includes Terminal-Bench Hard and SciCode.

GPT-5.5 (xhigh) currently has the highest Coding Index score, with a score of 59 among models with published results. View model

A higher Coding Index score indicates stronger overall performance across the benchmarks that make up the index. For a specific use case, individual benchmark results may be more informative than the composite score.