Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Vision Models: LLMs with Image Input Capabilities

Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.

Intelligence

MMMU Pro (Intelligence benchmark)

Speed

Output Tokens per Second; Higher is better

Price

USD per 1k images at 1MP (1024x1024)

Navigation

Summary Analysis Intelligence Pricing Latency & Speed

Summary Analysis

Visual Reasoning vs. Image Input Price

Visual Reasoning Intelligence: MMMU Pro evaluation, Image Input Price: USD per 1k images at 1MP (1024x1024)

Most attractive quadrant

Claude 4 Sonnet (Reasoning)

Claude 4.5 Haiku (Reasoning)

Gemini 2.5 Flash (Reasoning)

Gemini 2.5 Flash-Lite (Reasoning)

GPT-5.1 (Non-reasoning)

Llama 4 Maverick

Qwen3 VL 30B A3B (Reasoning)

Qwen3 VL 8B (Reasoning)

Reasoning models are indicated by a lightbulb icon.

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)

Visual Reasoning Intelligence: MMMU Pro evaluation, Latency (Time to First Token)

Most attractive quadrant

Claude 4 Sonnet (Reasoning)

Claude 4.5 Haiku (Reasoning)

Claude Opus 4.6 (Adaptive Reasoning, Max Effort)

Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)

Gemini 2.5 Flash (Reasoning)

Gemini 2.5 Flash-Lite (Reasoning)

Gemini 3 Flash Preview (Reasoning)

Gemini 3.1 Flash-Lite Preview

Gemini 3.1 Pro Preview

GPT-5.1 (Non-reasoning)

GPT-5.3 Codex (xhigh)

GPT-5.4 (xhigh)

Grok 4.1 Fast (Reasoning)

Grok 4.20 Beta 0309 (Reasoning)

Kimi K2.5 (Reasoning)

Llama 4 Maverick

Mistral Large 3

Nova 2.0 Pro Preview (medium)

Qwen3 VL 30B A3B (Reasoning)

Qwen3 VL 8B (Reasoning)

Qwen3.5 397B A17B (Reasoning)

Reasoning models are indicated by a lightbulb icon.

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Intelligence

Visual Reasoning Intelligence (MMMU Pro evaluation)

Visual Reasoning Intelligence: MMMU Pro evaluation

Reasoning models are indicated by a lightbulb icon.

Multimodal reasoning quality evaluation based on 1.7k questions which require interpreting and reasoning over images.

Pricing

Image Input Pricing

Image Input Price: USD per 1k images at 1MP (1024x1024)

Reasoning models are indicated by a lightbulb icon.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Pricing: Language Input, Image Input and Language Output

Price (USD per M Tokens); Image Input Price: USD per 1k images at 1MP (1024x1024); Lower is better

Language Input Price

Image Input Price

Language Output Price

Reasoning models are indicated by a lightbulb icon.

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency & Speed

Latency (Single Image & 1,000 Language Tokens Input)

Seconds to First Token Received; Lower is better

Reasoning models are indicated by a lightbulb icon.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency Variance (Single Image & 1,000 Language Tokens Input)

Seconds to First Token Received; Results by percentile; Lower is better

Median; other points represent Min, 25th, 75th percentiles and Max respectively

Reasoning models are indicated by a lightbulb icon.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Output Speed (Single Image & 1,000 Language Tokens Input)

Output Tokens per Second

Reasoning models are indicated by a lightbulb icon.

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Vision Models: LLMs with Image Input Capabilities

Navigation

Summary Analysis

Visual Reasoning vs. Image Input Price

Visual Reasoning

Price per 1k 1MP images

Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)

Visual Reasoning

Latency (Time to First Token)

Intelligence

Visual Reasoning Intelligence (MMMU Pro evaluation)

MMMU Pro

Pricing

Image Input Pricing

Price per 1k 1MP images

Pricing: Language Input, Image Input and Language Output

Input Price

Output Price

Price per 1k 1MP images

Model Performance Representation

Latency & Speed

Latency (Single Image & 1,000 Language Tokens Input)

Latency (Time to First Token)

Model Performance Representation

Latency Variance (Single Image & 1,000 Language Tokens Input)

Latency (Time to First Token)

Model Performance Representation

Boxplot

Output Speed (Single Image & 1,000 Language Tokens Input)

Output Speed

Model Performance Representation