Artificial Analysis Capability Indices Methodology
Overview
Artificial Analysis Capability Indices measure how well models perform on specific use cases, whether broad capabilities such as coding or professional fields such as legal work.
Indices are either horizontal or vertical. Horizontal indices measure a capability that applies across domains, such as Coding or Agentic work, and are calculated as an equal-weighted average of their component benchmarks. Vertical indices target a single profession or industry, such as Legal, Healthcare & Medical, or Finance & Accounting, and weight a set of capabilities by how often each appears in real-world tasks for that field.
Every component benchmark is run independently by Artificial Analysis and normalized to a 0–100 scale before being combined. Like all evaluation metrics, capability indices have limitations and may not apply directly to every use case, but they provide a useful synthesis for comparing models on the work that matters to a given domain.
The underlying benchmarks, including how each one is run and scored, are documented in the Intelligence Benchmarking methodology.
Index Breakdown
Horizontal indices assess a broad capability that applies across many roles and domains, equal-weighting the component benchmarks they are built from. Vertical indices assess capability within a specific industry or job, weighting each component by how frequently it appears in the real tasks of that field. Our task weightings draw on an O*NET-style taxonomy of work activities. Like all evaluation metrics, these indices have limitations and may not map directly to every use case. The table below shows the components and weights for each index.
| Index | Type | Component | Weight | Description |
|---|---|---|---|---|
| Agentic Index | Horizontal | Agentic Knowledge Work | 50% | Tool use, planning, and multi-step execution of real-world knowledge work (GDPval-AA v2) |
| Agentic Customer Interaction | 50% | Multi-turn, tool-using customer-support workflows over a knowledge base (𝜏³-Banking) | ||
| Coding Index | Horizontal | Agentic Terminal Use | 50% | Operating real terminal environments for builds, scripts, and debugging (Terminal-Bench v2.1) |
| Code Generation | 50% | Generating and executing scientific research code (SciCode) | ||
| Engineering Index | Vertical | Engineering Knowledge | 35% | Domain recall across civil, electrical, mechanical, and other engineering disciplines (AA-Omniscience) |
| Reasoning | 35% | Multi-step quantitative reasoning for derivations, sizing calculations, and design trade-offs (Humanity's Last Exam, GPQA Diamond, CritPt) | ||
| Agentic Knowledge Work | 25% | Tool use, planning, and multi-step execution of engineering deliverables (GDPval-AA v2) | ||
| Agentic Terminal Use | 5% | Operating real terminal environments for builds, scripts, system administration, and debugging (Terminal-Bench v2.1) | ||
| Finance & Accounting Index | Vertical | Business Knowledge | 30% | Domain recall in accounting, corporate finance, economics, and investments (AA-Omniscience) |
| Agentic Knowledge Work | 30% | Tool use, planning, and multi-step task execution, such as running spreadsheets, querying ERPs, or coordinating a workflow (GDPval-AA v2) | ||
| Reasoning | 20% | Multi-step quantitative and analytic reasoning, used for sensitivity analysis, valuation, and structured problem solving (Humanity's Last Exam) | ||
| Agentic Customer Interaction | 10% | Multi-turn, tool-using customer workflows over an unstructured knowledge base (𝜏³-Banking) | ||
| Long-Context | 5% | Reading and reasoning across long financial filings, deal documents, and research notes (AA-LCR) | ||
| Non-Hallucination | 5% | Avoiding fabricated figures or citations (AA-Omniscience) | ||
| Strategy & Ops Index | Vertical | Business Knowledge | 30% | Working knowledge of business processes, accounting basics, and operational concepts (AA-Omniscience) |
| Agentic Knowledge Work | 30% | Tool use, planning, and orchestrating multi-step office workflows end-to-end (GDPval-AA v2) | ||
| Agentic Customer Interaction | 30% | Multi-turn dialogue and tool use to resolve customer and stakeholder requests (𝜏³-Banking) | ||
| Instruction Following | 5% | Following exact formats and policy constraints (IFBench) | ||
| Long-Context | 5% | Holding context across long threads, policies, and records (AA-LCR) | ||
| Legal Index | Vertical | Legal Knowledge | 35% | Recall of statutes, doctrines, and procedure across jurisdictions (AA-Omniscience) |
| Agentic Knowledge Work | 25% | Running matter-management workflows, drafting pipelines, and tool-augmented research (GDPval-AA v2) | ||
| Long-Context | 10% | Reading and synthesizing across contracts, discovery productions, and case-law packets (AA-LCR) | ||
| Non-Hallucination | 10% | Avoiding fabricated case cites or invented statutes (AA-Omniscience) | ||
| Agentic Customer Interaction | 5% | Multi-turn, tool-using client intake and support workflows (𝜏³-Banking) | ||
| Reasoning | 15% | Multi-step argumentation, statutory interpretation, and weighing conflicting authorities (Humanity's Last Exam) | ||
| Healthcare & Medical Index | Vertical | Medical & Health Knowledge | 35% | Clinical knowledge across diagnosis, pharmacology, and care pathways (AA-Omniscience) |
| Agentic Knowledge Work | 25% | Tool use, planning, and orchestrating EHR/pharmacy workflows end-to-end (GDPval-AA v2) | ||
| Non-Hallucination | 15% | Avoiding fabricated drug interactions, doses, or guidelines (AA-Omniscience) | ||
| Reasoning | 15% | Multi-step clinical reasoning across biology and medicine (Humanity's Last Exam) | ||
| Agentic Customer Interaction | 10% | Multi-turn, tool-using patient and member support workflows (𝜏³-Banking) | ||
| Economics Index | Vertical | Economics Knowledge | 35% | Recall across micro and macroeconomics, public finance, and markets (AA-Omniscience) |
| Reasoning | 35% | Multi-step quantitative and analytic reasoning for modeling, estimation, and inference (Humanity's Last Exam) | ||
| Agentic Knowledge Work | 15% | Tool use, planning, and multi-step execution of analytical deliverables (GDPval-AA v2) | ||
| Long-Context | 15% | Reading and reasoning across long reports, datasets, and research notes (AA-LCR) |