All capability indexes

Agentic Index

Measures performance in agentic workflows, focusing on behaviors like tool use, planning, autonomy, and complex problem solving.

The headline score is the average of the benchmarks listed below. Each row links to its result chart further down the page when one is available, or out to the underlying benchmark.

  • GDPval-AA

    GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

  • ๐œยฒ-Bench Telecom

    A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Results

Agentic Index

Measures performance in agentic workflows, focusing on behaviors like tool use, planning, autonomy, and complex problem solving.
Reasoning models are indicated by a lightbulb icon

Release Date

Agentic Index vs. Release Date

Most attractive quadrant

Token Usage

Agentic Index: Output Token Composition

Tokens used to run the evaluation

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

Agentic Index: Cost Breakdown

Cost (USD) to run the evaluation

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Frequently Asked Questions

The Agentic Index is a composite benchmark from Artificial Analysis that measures performance in agentic workflows, focusing on behaviors like tool use, planning, autonomy, and complex problem solving.

The Agentic Index is calculated as the average of its underlying benchmark scores, normalised to a 0โ€“100 scale.

The Agentic Index includes GDPval-AA and ๐œยฒ-Bench Telecom.

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) currently has the highest Agentic Index score, with a score of 78 among models with published results. View model

A higher Agentic Index score indicates stronger overall performance across the benchmarks that make up the index. For a specific use case, individual benchmark results may be more informative than the composite score.