Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Vision Models: LLMs with Image Input Capabilities

Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.

Intelligence
MMMU Pro (Intelligence benchmark)
Speed
Output Tokens per Second; Higher is better
Price
USD per 1k images at 1MP (1024x1024)

Summary Analysis

Visual Reasoning vs. Image Input Price

Loading...
Visual Reasoning Intelligence: MMMU Pro evaluation, Image Input Price: USD per 1k images at 1MP (1024x1024)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Haiku (Reasoning)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
GPT-5.1 (Non-reasoning)
Llama 4 Maverick
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)
Reasoning models are indicated by a lightbulb icon.

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)

Loading...
Visual Reasoning Intelligence: MMMU Pro evaluation, Latency (Time to First Token)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Haiku (Reasoning)
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
Gemini 3 Flash Preview (Reasoning)
Gemini 3.1 Flash-Lite Preview
Gemini 3.1 Pro Preview
GPT-5.1 (Non-reasoning)
GPT-5.3 Codex (xhigh)
GPT-5.4 (xhigh)
Grok 4.1 Fast (Reasoning)
Grok 4.20 Beta 0309 (Reasoning)
Kimi K2.5 (Reasoning)
Llama 4 Maverick
Mistral Large 3
Nova 2.0 Pro Preview (medium)
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)
Qwen3.5 397B A17B (Reasoning)
Reasoning models are indicated by a lightbulb icon.

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Intelligence

Visual Reasoning Intelligence (MMMU Pro evaluation)

Loading...
Visual Reasoning Intelligence: MMMU Pro evaluation
Reasoning models are indicated by a lightbulb icon.

Multimodal reasoning quality evaluation based on 1.7k questions which require interpreting and reasoning over images.

Pricing

Image Input Pricing

Loading...
Image Input Price: USD per 1k images at 1MP (1024x1024)
Reasoning models are indicated by a lightbulb icon.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Pricing: Language Input, Image Input and Language Output

Loading...
Price (USD per M Tokens); Image Input Price: USD per 1k images at 1MP (1024x1024); Lower is better
Language Input Price
Image Input Price
Language Output Price
Reasoning models are indicated by a lightbulb icon.

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency & Speed

Latency (Single Image & 1,000 Language Tokens Input)

Loading...
Seconds to First Token Received; Lower is better
Reasoning models are indicated by a lightbulb icon.

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency Variance (Single Image & 1,000 Language Tokens Input)

Loading...
Seconds to First Token Received; Results by percentile; Lower is better
Median; other points represent Min, 25th, 75th percentiles and Max respectively
Reasoning models are indicated by a lightbulb icon.

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Picture of the author

Output Speed (Single Image & 1,000 Language Tokens Input)

Loading...
Output Tokens per Second
Reasoning models are indicated by a lightbulb icon.

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).