Healthcare & Medical Index

Assesses model performance across the healthcare and medical domain. Capabilities evaluated include domain-specific knowledge (medicine, public health, biomedical sciences), clinical diagnosis and assessment, patient documentation, medication management, and more.

See representative workflows

The Artificial Analysis Healthcare & Medical Index combines performance across benchmarks chosen for clinical and healthcare-support work. We map common tasks from O*NET occupational classifications, then select benchmarks that represent this real-world work. Weights are derived from how often capabilities appear across those tasks.

This composite metric provides a single score for tracking model performance across healthcare tasks. All underlying benchmarks are run independently by Artificial Analysis. See our Intelligence Benchmarking Methodology for how evaluations are conducted.

Capability	Weight	Evaluations
Medical & Health Knowledge	35%	AA-Omniscience Health Accuracy
Agentic Knowledge Work	25%	GDPval-AA v2
Non-Hallucination	15%	AA-Omniscience Non-Hallucination
Reasoning	15%	HLE
Agentic Customer Interaction	10%	𝜏³-Banking

AA-Omniscience: Knowledge and Hallucination Benchmark

GDPval-AA v2 Leaderboard

𝜏³-Banking Benchmark Leaderboard

Humanity's Last Exam Benchmark Leaderboard

Agentic Index

Coding Index

Finance & Accounting Index

Score

Artificial Analysis Healthcare & Medical Index

Incorporates 4 evaluations: AA-Omniscience, GDPval-AA v2, Humanity's Last Exam, 𝜏³-Banking · Higher is better

Reasoning models are indicated by a lightbulb icon

Artificial Analysis Healthcare & Medical Index: Capability Breakdown

Incorporates 4 evaluations: AA-Omniscience, GDPval-AA v2, Humanity's Last Exam, 𝜏³-Banking · Segmented by contribution

Reasoning models are indicated by a lightbulb icon

Capability Breakdown

Artificial Analysis Healthcare & Medical Index: Medical & Health Knowledge

Incorporates 1 evaluation: AA-Omniscience · Higher is better

Reasoning models are indicated by a lightbulb icon

Representative Workflows

Real-world workflows that exercise the capabilities the Healthcare & Medical Index weights most heavily.

Patient diagnosis & clinical assessmentMedical & Health KnowledgeNon-HallucinationReasoning

Example: Reassess a returning patient with worsening symptoms against the original EHR workup to build a differential from the new labs and imaging and surface alternative diagnoses the findings point to.

Treatment administration & proceduresMedical & Health KnowledgeNon-Hallucination

Example: A surgical team that encounters unexpected anatomy mid-laparoscopic-procedure. Retrieve comparable case reports and imaging precedents and quickly output findings relevant to their immediate decision.

Patient documentation & chartingMedical & Health KnowledgeAgentic Knowledge Work

Example: Turn a clinician's dictated notes from a follow-up visit into a structured SOAP note, pulling the patient's active problems and relevant history from the existing chart, placing each finding in the right section, and flagging the gaps the next provider would need filled.

Medication management & pharmacy coordinationMedical & Health KnowledgeNon-HallucinationReasoning

Example: Calculate a child's per-dose amount from their measurements and the prescriber's notes against the available suspension concentration, convert it to the millilitres to measure at each dose, and produce caregiver instructions that keep the total within the safe daily range.

Continuing education & researchMedical & Health KnowledgeAgentic Knowledge WorkReasoning

Example: Evaluate whether a dermatology team should adopt a newer procedure backed by emerging but limited long-term evidence to summarise the published trials and safety data, compare outcomes against the current standard of care, and outline the open questions the team still needs to resolve.

Patient education & care coordinationMedical & Health KnowledgeAgentic Knowledge WorkAgentic Customer Interaction

Example: Turn a patient's after-visit summary into plain-language, step-by-step home-care instructions in their preferred language, anticipate the questions they are most likely to ask, and confirm the follow-up appointment and how to reach the clinic with concerns.

Release Date

Artificial Analysis Healthcare & Medical Index vs. Release Date

Most attractive region

Cost

Artificial Analysis Healthcare & Medical Index: Cost per Task

Average cost per task (USD), broken down by input, cache hit, cache write, reasoning, and answer tokens

Average cost per task in the index. Costs are split by input, cache hit, cache write, reasoning, and answer token pricing where canonical token counts are available.

Artificial Analysis Healthcare & Medical Index: Total Cost

Total cost (USD) to run the index

The cost to run the index, calculated using the model's input and output token pricing and the number of tokens used.

Speed

Artificial Analysis Healthcare & Medical Index: Time per Task

Weighted average decode time (minutes) per task; excludes TTFT and overhead time · Lower is better

The weighted average time (minutes) per index task. This is calculated by dividing output tokens per task by output speed, weighted by the relative weights of each benchmark in the index.

Output Tokens

Artificial Analysis Healthcare & Medical Index: Output Tokens per Task

Output tokens used to run one task, broken down by reasoning and answer tokens

The average number of answer and reasoning tokens produced per benchmark task in this index.

Frequently Asked Questions

Based on the Artificial Analysis Healthcare & Medical Index, the top-performing AI models for healthcare and medical work are currently Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (49), GPT-5.6 Sol (max) (45), and Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (44). Rankings are updated as new models are released.

Yes. The Healthcare & Medical Index from Artificial Analysis is an independent benchmark of how AI models perform on healthcare and medical work. It measures performance on the medical and healthcare domain, including clinical knowledge, clinical reasoning, agentic workflows, and non-hallucination.

The Healthcare & Medical Index is a composite benchmark from Artificial Analysis that assesses model performance across the healthcare and medical domain. Capabilities evaluated include domain-specific knowledge (medicine, public health, biomedical sciences), clinical diagnosis and assessment, patient documentation, medication management, and more.

The Healthcare & Medical Index is calculated as a weighted average of its capability sub-scores. The sub-scores and their weights are: Medical & Health Knowledge (35%), Agentic Knowledge Work (25%), Non-Hallucination (15%), Reasoning (15%), and Agentic Customer Interaction (10%).

The Healthcare & Medical Index includes AA-Omniscience Health Accuracy, GDPval-AA v2, AA-Omniscience Non-Hallucination, HLE, and 𝜏³-Banking.

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) currently has the highest Healthcare & Medical Index score, with a score of 49 among models with published results. View model

A higher Healthcare & Medical Index score indicates stronger overall performance across the benchmarks that make up the index. For a specific use case, individual benchmark results may be more informative than the composite score.

Healthcare & Medical Index

Background

Methodology

Component Benchmarks

Related Links

Score

Artificial Analysis Healthcare & Medical Index

Artificial Analysis Healthcare & Medical Index: Capability Breakdown

Capability Breakdown

Artificial Analysis Healthcare & Medical Index: Medical & Health Knowledge

Representative Workflows

Release Date

Artificial Analysis Healthcare & Medical Index vs. Release Date

Cost

Artificial Analysis Healthcare & Medical Index: Cost per Task

Index Cost per Task

Artificial Analysis Healthcare & Medical Index: Total Cost

Index Cost

Speed

Artificial Analysis Healthcare & Medical Index: Time per Task

Index Time per Task

Output Tokens

Artificial Analysis Healthcare & Medical Index: Output Tokens per Task

Index Output Tokens per Task

Frequently Asked Questions

Which AI is best for doctors and clinicians?

Is there an AI benchmark for healthcare and medical work?

What is the Healthcare & Medical Index?

How is the Healthcare & Medical Index calculated?

Which benchmarks are included in the Healthcare & Medical Index?

Which AI model has the highest Healthcare & Medical Index score?

How should I interpret the Healthcare & Medical Index?