Vision Models: LLMs with Image Input Capabilities
Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.
Intelligence
MMMU Pro (Intelligence benchmark)
Speed
Output Tokens per Second; Higher is better
Price
USD per 1k images at 1MP (1024x1024)
Summary Analysis
Visual Reasoning vs. Image Input Price
Visual Reasoning Intelligence: MMMU Pro evaluation, Image Input Price: USD per 1k images at 1MP (1024x1024)
Most attractive quadrant
Claude 4.5 Haiku
Claude Opus 4.6 (max)
Claude Sonnet 4.6 (max)
Gemini 3 Flash
Gemini 3.1 Flash-Lite Preview
GPT-5.4 mini (xhigh)
GPT-5.4 nano (xhigh)
Grok 4.20 Beta 0309
Llama 4 Maverick
Mistral Large 3
Mistral Small 4
Nova 2.0 Pro Preview (medium)
Qwen3.5 122B A10B
Qwen3.5 27B
Qwen3.5 35B A3B
Qwen3.5 397B A17B
Reasoning models are indicated by a lightbulb icon.
Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)
Visual Reasoning Intelligence: MMMU Pro evaluation, Latency (Time to First Token)
Most attractive quadrant
Claude 4.5 Haiku
Claude Opus 4.6 (max)
Claude Sonnet 4.6 (max)
Gemini 3 Flash
Gemini 3.1 Flash-Lite Preview
Gemini 3.1 Pro Preview
GLM-4.6V
GPT-5.4 (xhigh)
GPT-5.4 mini (xhigh)
GPT-5.4 nano (xhigh)
Grok 4.20 Beta 0309
Kimi K2.5
Llama 4 Maverick
Mistral Large 3
Mistral Small 4
Nova 2.0 Pro Preview (medium)
Qwen3.5 122B A10B
Qwen3.5 27B
Qwen3.5 2B
Qwen3.5 35B A3B
Qwen3.5 397B A17B
Qwen3.5 4B
Qwen3.5 9B
Reasoning models are indicated by a lightbulb icon.
Intelligence
Visual Reasoning Intelligence (MMMU Pro evaluation)
Visual Reasoning Intelligence: MMMU Pro evaluation
Reasoning models are indicated by a lightbulb icon.
Pricing
Image Input Pricing
Image Input Price: USD per 1k images at 1MP (1024x1024)
Reasoning models are indicated by a lightbulb icon.
Pricing: Language Input, Image Input and Language Output
Price (USD per M Tokens); Image Input Price: USD per 1k images at 1MP (1024x1024); Lower is better
Language Input Price
Image Input Price
Language Output Price
Reasoning models are indicated by a lightbulb icon.
Latency & Speed
Latency (Single Image & 1,000 Language Tokens Input)
Seconds to First Token Received; Lower is better
Reasoning models are indicated by a lightbulb icon.
Latency Variance (Single Image & 1,000 Language Tokens Input)
Seconds to First Token Received; Results by percentile; Lower is better
Median; other points represent Min, 25th, 75th percentiles and Max respectively
Reasoning models are indicated by a lightbulb icon.
Output Speed (Single Image & 1,000 Language Tokens Input)
Output Tokens per Second
Reasoning models are indicated by a lightbulb icon.
