Agentic Index
The Agentic Index currently includes the following benchmarks:
- GDPval-AA
GDPval-AA is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with ELO ratings derived from blind pairwise comparisons.
- 𝜏²-Bench Telecom
A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.
Agentic Index
Agentic Index vs. Release Date
Agentic Index: Output Token Composition
The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).
Agentic Index: Cost Breakdown
The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.