Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis
All capability indexes

Agentic Index

Measures performance in agentic workflows, focusing on behaviors like tool use, planning, autonomy, and complex problem solving.

The Agentic Index currently includes the following benchmarks:

  • GDPval-AA

    GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with ELO ratings derived from blind pairwise comparisons.

  • 𝜏²-Bench Telecom

    A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

Agentic Index

Independently conducted by Artificial Analysis

Agentic Index vs. Release Date

Most attractive region
Alibaba
Amazon
Anthropic
DeepSeek
Google
Kimi
KwaiKAT
LG AI Research
MBZUAI Institute of Foundation Models
Meta
MiniMax
Mistral
NVIDIA
OpenAI
xAI
Xiaomi
Z AI

Agentic Index: Output Token Composition

Tokens used to run the evaluation
Reasoning tokens
Answer tokens

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Agentic Index: Cost Breakdown

Cost (USD) to run the evaluation
Input cost
Reasoning cost
Answer cost

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.