Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Coding Index

Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.

The Coding Index currently includes the following benchmarks:

Terminal-Bench Hard
An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks.
SciCode
A scientist-curated coding benchmark featuring 338 sub-tasks derived from 80 genuine laboratory problems across 16 scientific disciplines.

Coding Index

Independently benchmarked by Artificial Analysis

+ Add model from specific provider

Coding Index vs. Release Date

+ Add model from specific provider

Most attractive region

Alibaba

Amazon

Anthropic

DeepSeek

Google

Kimi

Korea Telecom

LG AI Research

MBZUAI Institute of Foundation Models

Coding Index: Output Token Composition

Tokens used to run the evaluation

+ Add model from specific provider

Reasoning tokens

Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Coding Index: Cost Breakdown

Cost (USD) to run the evaluation

+ Add model from specific provider

Input cost

Reasoning cost

Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Coding Index

Related links

Coding Index

Coding Index vs. Release Date

Coding Index: Output Token Composition

Evaluation Token Usage

Coding Index: Cost Breakdown

Evaluation Cost