Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Vision Models: LLMs with Image Input Capabilities

Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.

Highlights

Intelligence
MMMU Pro (Intelligence benchmark)
Speed
Output Tokens per Second; Higher is better
Price
USD per 1k images at 1MP (1024x1024)

Summary Analysis

Visual Reasoning vs. Image Input Price

Visual Reasoning Intelligence: MMMU Pro evaluation, Image Input Price: USD per 1k images at 1MP (1024x1024)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Sonnet (Reasoning)
Claude Opus 4.5 (Reasoning)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
GPT-5.1 (Non-reasoning)
Grok 4
Llama 4 Maverick
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)

Visual Reasoning Intelligence: MMMU Pro evaluation, Latency (Time to First Token)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Sonnet (Reasoning)
Claude Opus 4.5 (Reasoning)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
Gemini 3 Flash Preview (Reasoning)
Gemini 3 Pro Preview (high)
GPT-5.1 (Non-reasoning)
GPT-5.2 Codex (xhigh)
Grok 4
Grok 4.1 Fast (Reasoning)
Kimi K2.5 (Reasoning)
Llama 4 Maverick
Mistral Large 3
Nova 2.0 Pro Preview (medium)
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)

Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Intelligence

Visual Reasoning Intelligence (MMMU Pro evaluation)

Visual Reasoning Intelligence: MMMU Pro evaluation

Multimodal reasoning quality evaluation based on 1.7k questions which require interpreting and reasoning over images.

Image Input Pricing

Image Input Price: USD per 1k images at 1MP (1024x1024)

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Pricing: Language Input, Image Input and Language Output

Price (USD per M Tokens); Image Input Price: USD per 1k images at 1MP (1024x1024); Lower is better
Language Input Price
Image Input Price
Language Output Price

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

Price per token generated by the model (received from the API), represented as USD per million Tokens.

Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency & Speed

Latency (Single Image & 1,000 Language Tokens Input)

Seconds to First Token Received; Lower is better

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency Variance (Single Image & 1,000 Language Tokens Input)

Seconds to First Token Received; Results by percentile; Lower is better
Median; other points represent Min, 25th, 75th percentiles and Max respectively

Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Picture of the author

Output Speed (Single Image & 1,000 Language Tokens Input)

Output Tokens per Second

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).