All capability indexes

Humanities Index

Measures performance on capabilities that matter most for humanities and social sciences work, including domain knowledge, agentic research, and long-context reading. Scores average evaluations spanning agentic benchmarks and the AA-Omniscience Humanities & Social Sciences domain across history, literature, philosophy, politics, and religion.

The headline score is the average of the benchmarks listed below. Each row links to its result chart further down the page when one is available, or out to the underlying benchmark.

  • GDPval-AA v2

    GDPval-AA v2 is Artificial Analysis' evaluation framework for OpenAI's GDPval dataset. It tests AI models on real-world tasks across 44 occupations and 9 major industries. Models are given shell access and web browsing capabilities in an agentic loop via Stirrup to solve tasks, with Elo ratings derived from blind pairwise comparisons.

  • 𝜏²-Bench Telecom

    A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues.

  • AA-LCR

    A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens (measured using the cl100k_base tokenizer).

  • AA-Omniscience

    A benchmark measuring factual recall and hallucination across various economically relevant domains.

Score

Humanities Index

Average of all evaluations on this page · Higher is better
Reasoning models are indicated by a lightbulb icon

Release Date

Humanities Index vs. Release Date

Most attractive quadrant

Speed

Humanities Index vs. Output Speed

Humanities Index · Output tokens per second
Most attractive quadrant

There is a trade-off between model quality and output speed, with higher intelligence models typically having lower output speed.

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Pricing

Pricing: Input and Output Prices

USD per 1M tokens (blended)
Reasoning models are indicated by a lightbulb icon

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Figures represent median (P50) measurement over the past 72 hours to reflect sustained changes in performance.

Humanities Index vs. Price

Humanities Index · USD per 1M tokens (blended)
Most attractive quadrant

While higher intelligence models are typically more expensive, they do not all follow the same price-quality curve.

Token Usage

Humanities Index: Output Token Composition

Tokens used to run the evaluation

The total number of tokens used to run the evaluation, including input tokens (prompt), reasoning tokens (for reasoning models), and answer tokens (final response).

Cost

Humanities Index: Cost Breakdown

Cost (USD) to run the evaluation

The cost to run the evaluation, calculated using the model's input and output token pricing and the number of tokens used.

Frequently Asked Questions

The Humanities Index is a composite benchmark from Artificial Analysis that measures performance on capabilities that matter most for humanities and social sciences work, including domain knowledge, agentic research, and long-context reading. Scores average evaluations spanning agentic benchmarks and the AA-Omniscience Humanities & Social Sciences domain across history, literature, philosophy, politics, and religion.

The Humanities Index is calculated as the average of its underlying benchmark scores, normalised to a 0–100 scale.

The Humanities Index includes GDPval-AA v2, 𝜏²-Bench Telecom, AA-LCR, and AA-Omniscience.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) currently has the highest Humanities Index score, with a score of 53 among models with published results. View model

A higher Humanities Index score indicates stronger overall performance across the benchmarks that make up the index. For a specific use case, individual benchmark results may be more informative than the composite score.