All capability indexes

Artificial Analysis Engineering Index

Measures performance on capabilities that matter most for engineering work, including engineering knowledge, quantitative reasoning, agentic knowledge work, and terminal use. Weights reflect how often each capability appears across common engineering tasks.

See representative workflows

The Artificial Analysis Engineering Index combines performance across benchmarks chosen for engineering work, spanning engineering knowledge, reasoning, agentic execution, and terminal use.

This composite metric prevents narrow specialization and provides a single score for tracking model performance across engineering tasks.

Each capability sub-score is normalized to a 0-100 scale, then combined using the weights below. All underlying benchmarks are run independently by Artificial Analysis. See our Intelligence Benchmarking Methodology for how evaluations are conducted.

CapabilityWeightEvaluations
Engineering Knowledge35%AA-Omniscience Science, Engineering & Mathematics Accuracy
Reasoning35%HLE, GPQA Diamond, Crit-Pt
Agentic Knowledge Work25%GDPval-AA v2
Agentic Terminal Use5%Terminal-Bench v2.1

Score

Artificial Analysis Engineering Index

Weighted across capabilities relevant to engineering work · Higher is better
Not currently available
Reasoning models are indicated by a lightbulb icon

Engineering Index: Capability Breakdown

Each capability area on a 0–100 scale after normalization · Higher is better · Incorporates 6 evaluations: AA-Omniscience, Humanity's Last Exam, GPQA Diamond, CritPt, GDPval-AA v2, Terminal-Bench v2.1
Reasoning models are indicated by a lightbulb icon

Capability Breakdown

Engineering Index: Engineering Knowledge

Models ranked by engineering knowledge (normalized 0-100) · Higher is better · Incorporates 1 evaluation: AA-Omniscience
Not currently available
Reasoning models are indicated by a lightbulb icon

Representative Workflows

Real-world workflows that exercise the capabilities the Engineering Index weights most heavily.

Release Date

Engineering Index vs. Release Date

Most attractive region

Cost

Engineering Index: Cost per Task

Average cost per task (USD), broken down by input, cache hit, cache write, reasoning, and answer tokens

Average cost per task in the evaluation. Costs are split by input, cache hit, cache write, reasoning, and answer token pricing where canonical token counts are available.

Engineering Index: Total Cost

Total cost (USD) to run the evaluation

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Speed

Engineering Index: Time per Task

Weighted average wall clock time (minutes) per task; excludes TTFT and execution time · Lower is better

The weighted average time (seconds) per evaluation task. This is calculated by dividing output tokens per task by output speed, weighted by the relative weights of each benchmark in the evaluation.

Output Tokens

Engineering Index: Output Tokens per Task

Output tokens used to run one task, broken down by reasoning and answer tokens

The average number of answer and reasoning tokens produced per benchmark task in this evaluation.

Frequently Asked Questions

The Engineering Index is a composite benchmark from Artificial Analysis that measures performance on capabilities that matter most for engineering work, including engineering knowledge, quantitative reasoning, agentic knowledge work, and terminal use. Weights reflect how often each capability appears across common engineering tasks.

The Engineering Index is calculated as a weighted average of capability sub-scores, each normalized to a 0–100 scale. The sub-scores and their weights are: Engineering Knowledge (35%), Reasoning (35%), Agentic Knowledge Work (25%), and Agentic Terminal Use (5%).

The Engineering Index includes AA-Omniscience Science, Engineering & Mathematics Accuracy, HLE, GPQA Diamond, Crit-Pt, GDPval-AA v2, and Terminal-Bench v2.1.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) currently has the highest Engineering Index score, with a score of 63 among models with published results. View model

A higher Engineering Index score indicates stronger overall performance across the benchmarks that make up the index. For a specific use case, individual benchmark results may be more informative than the composite score.