Menu

logo
Artificial Analysis
HOME

Artificial Analysis Intelligence Benchmarking Methodology

Version 2.0 (Released February 2025)

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index combines a comprehensive suite of evaluation datasets to assess language model capabilities across reasoning, knowledge, maths and programming.

It is a helpful synthesis of overall language model intelligence and can be used to compare language models. Like all evaluation metrics, it has limitations and may not apply directly to every use case. However, we are confident that it is a more useful synthesis comparison between language models than any other metric in existence today.

Artificial Analysis independently benchmarks dozens of language models using our implementations of 8 evaluation datasets, as summarized below. Our methodology emphasizes fairness and real-world applicability.

We estimate a 95% confidence interval for Artificial Analysis Intelligence Index V2 of less than ±1% - based on experiments with > 10 repeats on certain models for all evaluation datasets included in Intelligence Index V2. Individual evaluation results may have wider than ±1% confidence intervals. We look forward to disclosing further detail from our statistical analysis in the future.

Artificial Analysis Intelligence Index is a text-only, English language evaluation suite. We benchmark models for image inputs, speech inputs and multilingual performance separately to the Intelligence Index evaluation suite.

Artificial Analysis Math Index: Represents the math ability of models. This is calculated as the average of math benchmarks in the Artificial Analysis Intelligence Index: AIME, MATH-500.

Artificial Analysis Coding Index: Represents the coding ability of models. This is calculated as the average of coding benchmarks in the Artificial Analysis Intelligence Index: LiveCodeBench, SciCode.

Artificial Analysis Multilingual Index: Represents the multilingual ability of models. This is calculated as the average of Multilingual MMLU and MGSM across supported languages. We support languages which are common across both evaluations. This includes Spanish, Bengali, German, Japanese, English, Chinese, Swahili and French.

Intelligence Index Evaluation Suite Summary

EvaluationFieldQuestionsRepeatsResponse TypeScoringIntelligence Index WeightingMath IndexCoding IndexMultilingual Index
MMLU-ProReasoning & Knowledge12,0321Multiple Choice (10 options)Regex extraction1/6
HLE (Humanity's Last Exam)Reasoning & Knowledge2,6841Open AnswerEquality Checker LLM1/6
GPQA DiamondScientific Reasoning1985Multiple Choice (4 options)Regex extraction1/6
MATH-500Quantitative Reasoning5003Open AnswerRegex extraction with SymPy-based normalization, plus equality checker LLM as backup1/81/2
AIME 2024Competition Math3010Numerical AnswerRegex extraction with SymPy-based normalization, plus equality checker LLM as backup1/81/2
SciCodeCode Generation338 subproblems3Python Code (must pass all unit tests)Code execution, pass@1, sub-problem scoring with scientist-annotated background prompting1/81/2
LiveCodeBenchCode Generation3153Python Code (must pass all unit tests)Code execution, pass@11/81/2
Multilingual MMLUMultilingual Reasoning & Knowledge~110,000 (~14,000 per language)1Multiple Choice (4 options)Regex extraction1/2
MGSMMultilingual Mathematics~2,000 (~250 per language)1Open AnswerRegex extraction1/2

Intelligence Evaluation Principles

Our evaluation approach is guided by four core principles:

  • Standardized: All models are evaluated under identical conditions with consistent prompting strategies, temperature settings, and evaluation criteria.
  • Unbiased: We employ evaluation techniques that avoid unfairly penalizing models for answers that correctly follow the instructions in our prompts. This includes using clear prompts, robust answer extraction methods, and flexible answer validation to accommodate valid variations in model outputs.
  • Zero-Shot Instruction Prompted: We evaluate using clear instructions without examples or demonstrations, testing models' ability to follow directions without few-shot learning. This approach aligns with modern instruction-tuned and chat style models.
  • Transparent: We fully disclose our methodology, including prompt templates, evaluation criteria, and limitations.

Intelligence Index Calculation

Artificial Analysis Intelligence Index is calculated as a weighted average across the constituent evaluations, balancing general knowledge, mathematical reasoning, and coding ability. The weighting is distributed as follows:

  • General Reasoning and Knowledge (50%): Equally weighted between MMLU-Pro, HLE, and GPQA Diamond, representing broad knowledge and reasoning capabilities across academic and scientific domains
  • Mathematical Reasoning (25%): Equally weighted between MATH-500 and AIME 2024, combining general mathematical problem-solving with advanced competition-level mathematics
  • Code Generation (25%): Equally weighted between SciCode and LiveCodeBench, testing Python programming for scientific computing and general competition-style programming

General Testing Parameters

We test all evals with the following settings:

  • Temperature: 0
  • Maximum output tokens:
    • Non-reasoning models: 4,096 tokens (adjusted downward where models have a smaller context window, or lower maximum output tokens cap)
    • Reasoning models: Maximum output tokens allowed, as disclosed by model creators (custom setting for each reasoning model)
    • A related note - look out soon for a new
  • Code evaluation environment:
    • Ubuntu 22.04 LTS
    • Python 3.12
  • Error handling:
    • Automatic retry on API failures (up to 30 attempts)
    • All questions that failed all 30 retries are manually reviewed. Results where persistent API failures have caused issues are not published. Errors where all available APIs for proprietary models block a certain question may lower scores (this effect is not material)
  • We maintain internal copies of all evaluation datasets. The sources of our selected datasets are listed below.

Evaluation Suite Details

MMLU-Pro (Multi-Task Language Understanding Benchmark, Pro version)

  • Description: Comprehensive evaluation of advanced knowledge across domains, adapted from original MMLU
  • Paper: https://arxiv.org/abs/2406.01574
  • Dataset: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
  • Details:
    • 10 option multiple choice format, adapted from the original version of MMLU (Hendrycks et al., 4 option multiple choice) by Wang et al.
    • 12,032 questions across Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science
    • Regex-based answer extraction (prompt and regex below)

HLE (Humanity's Last Exam)

  • Description: Recent frontier academic benchmark from the Centre for AI Safety (led by Dan Hendrycks)
  • Paper: https://arxiv.org/abs/2501.14249v2
  • Dataset: https://huggingface.co/datasets/cais/hle
  • Implementation:
    • 2,684 challenging questions across mathematics, humanities and the natural sciences (total HLE dataset including questions that require image inputs totals 3,000 questions - we test only the text only subset for maximum comparability across models)
    • We note that the HLE authors disclose that their dataset curation process involved adversarial selection of questions based on tests with GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, o1, o1-mini, and o1-preview (latter two for text-only questions only). We therefore discourage direct comparison of these models with models that were not used in the HLE curation process, as the dataset is potentially biased against the models used in the curation process.
    • Evaluated with an equality checker LLM prompt, using OpenAI's GPT-4o (Aug) model per the original HLE paper (find prompt below)

GPQA Diamond (Graduate-Level Google-Proof Q&A Benchmark)

  • Description: Scientific knowledge and reasoning benchmark
  • Subset: Diamond subset (198 questions) selected for maximum accuracy and discriminative power
  • Paper: https://arxiv.org/abs/2311.12022
  • Dataset: https://github.com/openai/simple-evals/blob/main/gpqa_eval.py
  • Key Details:
    • 198 questions covering biology, physics and chemistry - we test the GPQA Diamond subset of the full GPQA dataset (448 questions total), which was defined by the original authors as the highest quality subset, where both experts answer correctly and the majority of non-experts answer incorrectly
    • 4 option multiple choice format
    • Regex-based answer extraction (prompt and regex below)

Mathematical Reasoning:

MATH-500 (Measuring Mathematical Problem Solving)

  • Description:  Mathematical problem-solving across various difficulty levels
  • Paper: https://arxiv.org/abs/2103.03874
  • Dataset: https://github.com/openai/simple-evals/blob/main/math_eval.py
  • Key Implementation Details:
    • MATH-500 is a subset of 500 randomly sampled questions from Hendrycks' 2021 MATH dataset, created by OpenAI in late 2024 as a consequence of their appropriation of 90% of the original 5000 MATH questions for training data for reinforcement learning on o1-series models
    • We implement a two-stage answer validation mechanism to allow grading with a high degree of precision (minimizing both false negatives and false positives).
      • Script-based grading, using OpenAI's PRM800K grading script -https://github.com/openai/prm800k/blob/main/prm800k/grading/grader.py
        • Implements symbolic equality checking via SymPy
        • High-precision validation for exact matches
      • Language model equality checker (runs on all answers not marked correct by script-based grading)
        • We use Llama 3.3 70B as the equality checker (prompt disclosed below)
        • We tested Llama 3.3 70B for agreement with human judgement and assessed it to grade correctly in >99% of cases

AIME (American Invitational Mathematics Examination) 2024

  • Description:  Advanced mathematical problem-solving dataset from the 2024 American Invitational Mathematics Examination
  • Dataset: https://huggingface.co/datasets/AI-MO/aimo-validation-aime/viewer
  • Key Implementation Details:
    • Strict numerical answer format - all answers are an integer between 1 and 999 (inclusive)
    • Similarly to MATH-500, we implement a two-stage answer validation mechanism to allow grading with a high degree of precision (minimizing both false negatives and false positives).
      • Script-based grading, using OpenAI's PRM800K grading script -https://github.com/openai/prm800k/blob/main/prm800k/grading/grader.py
        • Implements symbolic equality checking via SymPy
        • High-precision validation for exact matches
      • Language model equality checker (runs on all answers not marked correct by script-based grading)
        • We use Llama 3.3 70B as the equality checker (prompt disclosed below)
        • We tested Llama 3.3 70B for agreement with human judgement and assessed it to grade correctly in >99% of cases

Code Generation and Comprehension:

HumanEval

SciCode

LiveCodeBench

Prompt Templates, Answer Extraction and Evaluation

Multiple Choice Questions (MMLU-Pro & GPQA)

We prompt multi-choice evals with the following instruction prompt (for MMLU Pro, we include A/B/C/D/E/F/G/H/I/J as options). This prompt was independently developed by Artificial Analysis, and carefully validated with various ablation studies. We assess that this prompt is a clearer, and therefore fairer, approach than traditional completion-style multi-choice evaluation methodologies or other instruction prompts we tested.
Answer the following multiple choice question. The last line of your response should be in the following format: 'Answer: A/B/C/D' (e.g. 'Answer: A'). 

{Question}

A) {A}
B) {B}
C) {C}
D) {D}
For Instruct models (non-reasoning models), we append "Think step by step before answering." to the instrction in the first line of the prompt.

Multiple Choice Extraction Regex

We extract the answer from the last line of the response using the following regex:
(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])

HLE (Humanity's Last Exam)

We prompt HLE with the following text as a system prompt, with a difference between Exact Answer and Multiple Choice questions - these prompts are used with no changes from the original implementation by Hendrycks et al. For models that do not support a system prompt, we prepend the system prompt to the user prompt.
System Prompt for Exact Answer Questions: 

"Your response should be in the following format:
Explanation: {your explanation for your final answer}
Exact Answer: {your succinct, final answer}
Confidence: {your confidence score between 0% and 100% for your answer}"

System Prompt for Multiple Choice Questions: 

"Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}"

HLE Equality Checker Prompt

We use the following prompt on GPT-4o (Aug 2024) to check if the response answer is correct - this prompt is again used with no changes from the original implementation by Hendrycks et al.
Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.

[question]: {question}

[response]: {response}

Your judgement must be in the format and criteria specified below:

extracted_final_answer: The final exact answer extracted from the [response]. Put the extracted answer as 'None' if there is no exact, final answer to extract from the response.

[correct_answer]: {correct_answer}

reasoning: Explain why the extracted_final_answer is correct or incorrect based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.

correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.


confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.

Mathematical Questions (MATH-500 & AIME)

We prompt MATH-500 and AIME with the following instruction prompt:
Solve the following math problem step by step. Put your answer inside \\boxed{{}}.

{Question}

Remember to put your answer inside \\boxed{{}}.

Mathematical Equality Checker Prompt

As described above, we supplement our script-based grading with a language model equality checker. We use the following prompt with Llama 3.3 70B to check if two answers are equivalent. This prompt was developed by OpenAI and released in their simple-evals repository.
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications

Examples:

    Expression 1: $2x+3$
    Expression 2: $3+2x$

Yes

    Expression 1: 3/2
    Expression 2: 1.5

Yes

    Expression 1: $x^2+2x+1$
    Expression 2: $y^2+2y+1$

No

    Expression 1: $x^2+2x+1$
    Expression 2: $(x+1)^2$

Yes

    Expression 1: 3245/5
    Expression 2: 649

No
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)

    Expression 1: 2/(-3)
    Expression 2: -2/3

Yes
(trivial simplifications are allowed)

    Expression 1: 72 degrees
    Expression 2: 72

Yes
(give benefit of the doubt to units)

    Expression 1: 64
    Expression 2: 64 square feet

Yes
(give benefit of the doubt to units)

---

YOUR TASK


Respond with only "Yes" or "No" (without quotes). Do not include a rationale.

    Expression 1: %(expression1)s
    Expression 2: %(expression2)s

Code Generation Tasks

HumanEval

We prompt HumanEval with the following instruction prompt, followed by the function definition and docstring.
Read the following function signature and docstring, and fully implement the function described. Your response should only contain the code for this function.

SciCode

We prompt SciCode with the following prompt, used with no changes from the original implementation of the Scientist Annotated Background prompt by Tian et al.
PROBLEM DESCRIPTION:
You will be provided with the main description of the problem, previous steps, and the next step. Your task will be to generate the disciplinary knowledge necessary for solving the next step and then develop a Python solution focused on this step.

PREVIOUS STEPS DESCRIPTION:
{problem_steps_str}

NEXT STEP - PROBLEM DESCRIPTION AND FUNCTION HEADER:
This part will describe the next step in the problem-solving process. First, provide the necessary scientific background knowledge as a comment at the beginning of your response, starting with 'Background: '. Then, a function header will be provided, and your task is to develop the Python code for this next step based on the provided description and function header.

{next_step_str}

DEPENDENCIES:
Use only the following dependencies in your solution. Do not include these dependencies at the beginning of your code.
{dependencies}

RESPONSE GUIDELINES:
1. Start with the scientific background required for the next step, formatted as a comment.
2. Then write the complete and executable Python program for the next step in a single block.
3. Your response should focus exclusively on implementing the solution for the next step, adhering closely to the specified function header and the context provided by the initial steps.
4. DO NOT include previous function code, example usage or test code in your response.
5. Ensure your response is in the format of ```python``` and includes the necessary background as a comment at the top.

Example:
```python
# Background: [Here, insert the necessary scientific knowledge required for the next step.]

[Insert the Python code here based on the provided function header and dependencies.]
```

LiveCodeBench

We prompt LiveCodeBench with the following prompt, used with no changes from the original implementation of the LiveCodeBench prompt by the original team. We note, however, that we do not apply the custom system prompts that the LiveCodeBench team uses - we do not use their generic sytem prompt nor their custom system prompts for certain models.



Questions with starter code:

### Question:
{question.question_content}

### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.
```python
{question.starter_code}
```

### Answer: (use the provided format with backticks)


Questions without starter code:

### Question:
{question.question_content}

### Format: Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT.
```python
# YOUR CODE HERE
```

### Answer: (use the provided format with backticks

Code Extraction Regex

We extract the code from the response using the following regex:
(?<=```python\n)((?:\n|.)+?)(?=\n```)

Version History

  • V2.0: 11 February 2025 to current
  • V1.0 to V1.3: January 2024 to 10 February 2025