Artificial Analysis Capability Indices Methodology

Overview

Artificial Analysis Capability Indices measure how well models perform on specific use cases, whether broad capabilities such as coding or professional fields such as legal work.

Indices are either horizontal or vertical. Horizontal indices measure a capability that applies across domains, such as Coding or Agentic work, and are calculated as an equal-weighted average of their component benchmarks. Vertical indices target a single profession or industry, such as Legal, Healthcare & Medical, or Finance & Accounting, and weight a set of capabilities by how often each appears in real-world tasks for that field.

Every component benchmark is run independently by Artificial Analysis and normalized to a 0–100 scale before being combined. Like all evaluation metrics, capability indices have limitations and may not apply directly to every use case, but they provide a useful synthesis for comparing models on the work that matters to a given domain.

The underlying benchmarks, including how each one is run and scored, are documented in the Intelligence Benchmarking methodology.

Index Breakdown

Horizontal indices assess a broad capability that applies across many roles and domains, equal-weighting the component benchmarks they are built from. Vertical indices assess capability within a specific industry or job, weighting each component by how frequently it appears in the real tasks of that field. Our task weightings draw on an O*NET-style taxonomy of work activities. Like all evaluation metrics, these indices have limitations and may not map directly to every use case. The table below shows the components and weights for each index.

IndexTypeComponentWeightDescription
Agentic IndexHorizontalAgentic Knowledge Work50%Tool use, planning, and multi-step execution of real-world knowledge work (GDPval-AA v2)
Agentic Customer Interaction50%Multi-turn, tool-using customer-support workflows over a knowledge base (𝜏³-Banking)
Coding IndexHorizontalAgentic Terminal Use50%Operating real terminal environments for builds, scripts, and debugging (Terminal-Bench v2.1)
Code Generation50%Generating and executing scientific research code (SciCode)
Engineering IndexVerticalEngineering Knowledge35%Domain recall across civil, electrical, mechanical, and other engineering disciplines (AA-Omniscience)
Reasoning35%Multi-step quantitative reasoning for derivations, sizing calculations, and design trade-offs (Humanity's Last Exam, GPQA Diamond, CritPt)
Agentic Knowledge Work25%Tool use, planning, and multi-step execution of engineering deliverables (GDPval-AA v2)
Agentic Terminal Use5%Operating real terminal environments for builds, scripts, system administration, and debugging (Terminal-Bench v2.1)
Finance & Accounting IndexVerticalBusiness Knowledge30%Domain recall in accounting, corporate finance, economics, and investments (AA-Omniscience)
Agentic Knowledge Work30%Tool use, planning, and multi-step task execution, such as running spreadsheets, querying ERPs, or coordinating a workflow (GDPval-AA v2)
Reasoning20%Multi-step quantitative and analytic reasoning, used for sensitivity analysis, valuation, and structured problem solving (Humanity's Last Exam)
Agentic Customer Interaction10%Multi-turn, tool-using customer workflows over an unstructured knowledge base (𝜏³-Banking)
Long-Context5%Reading and reasoning across long financial filings, deal documents, and research notes (AA-LCR)
Non-Hallucination5%Avoiding fabricated figures or citations (AA-Omniscience)
Strategy & Ops IndexVerticalBusiness Knowledge30%Working knowledge of business processes, accounting basics, and operational concepts (AA-Omniscience)
Agentic Knowledge Work30%Tool use, planning, and orchestrating multi-step office workflows end-to-end (GDPval-AA v2)
Agentic Customer Interaction30%Multi-turn dialogue and tool use to resolve customer and stakeholder requests (𝜏³-Banking)
Instruction Following5%Following exact formats and policy constraints (IFBench)
Long-Context5%Holding context across long threads, policies, and records (AA-LCR)
Legal IndexVerticalLegal Knowledge35%Recall of statutes, doctrines, and procedure across jurisdictions (AA-Omniscience)
Agentic Knowledge Work25%Running matter-management workflows, drafting pipelines, and tool-augmented research (GDPval-AA v2)
Long-Context10%Reading and synthesizing across contracts, discovery productions, and case-law packets (AA-LCR)
Non-Hallucination10%Avoiding fabricated case cites or invented statutes (AA-Omniscience)
Agentic Customer Interaction5%Multi-turn, tool-using client intake and support workflows (𝜏³-Banking)
Reasoning15%Multi-step argumentation, statutory interpretation, and weighing conflicting authorities (Humanity's Last Exam)
Healthcare & Medical IndexVerticalMedical & Health Knowledge35%Clinical knowledge across diagnosis, pharmacology, and care pathways (AA-Omniscience)
Agentic Knowledge Work25%Tool use, planning, and orchestrating EHR/pharmacy workflows end-to-end (GDPval-AA v2)
Non-Hallucination15%Avoiding fabricated drug interactions, doses, or guidelines (AA-Omniscience)
Reasoning15%Multi-step clinical reasoning across biology and medicine (Humanity's Last Exam)
Agentic Customer Interaction10%Multi-turn, tool-using patient and member support workflows (𝜏³-Banking)
Economics IndexVerticalEconomics Knowledge35%Recall across micro and macroeconomics, public finance, and markets (AA-Omniscience)
Reasoning35%Multi-step quantitative and analytic reasoning for modeling, estimation, and inference (Humanity's Last Exam)
Agentic Knowledge Work15%Tool use, planning, and multi-step execution of analytical deliverables (GDPval-AA v2)
Long-Context15%Reading and reasoning across long reports, datasets, and research notes (AA-LCR)