Vision Models: LLMs with Image Input Capabilities
Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.
Intelligence
MMMU Pro (Intelligence benchmark)
Speed
Output Tokens per Second; Higher is better
Price
USD per 1k images at 1MP (1024x1024)
Summary Analysis
Visual Reasoning vs. Image Input Price
Visual Reasoning Intelligence: MMMU Pro evaluation, Image Input Price: USD per 1k images at 1MP (1024x1024)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Haiku (Reasoning)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
GPT-5.1 (Non-reasoning)
Llama 4 Maverick
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)
Reasoning models are indicated by a lightbulb icon.
Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)
Visual Reasoning Intelligence: MMMU Pro evaluation, Latency (Time to First Token)
Most attractive quadrant
Claude 4 Sonnet (Reasoning)
Claude 4.5 Haiku (Reasoning)
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
Gemini 2.5 Flash (Reasoning)
Gemini 2.5 Flash-Lite (Reasoning)
Gemini 3 Flash Preview (Reasoning)
Gemini 3.1 Flash-Lite Preview
Gemini 3.1 Pro Preview
GPT-5.1 (Non-reasoning)
GPT-5.3 Codex (xhigh)
GPT-5.4 (xhigh)
Grok 4.1 Fast (Reasoning)
Grok 4.20 Beta 0309 (Reasoning)
Kimi K2.5 (Reasoning)
Llama 4 Maverick
Mistral Large 3
Nova 2.0 Pro Preview (medium)
Qwen3 VL 30B A3B (Reasoning)
Qwen3 VL 8B (Reasoning)
Qwen3.5 397B A17B (Reasoning)
Reasoning models are indicated by a lightbulb icon.
Intelligence
Visual Reasoning Intelligence (MMMU Pro evaluation)
Visual Reasoning Intelligence: MMMU Pro evaluation
Reasoning models are indicated by a lightbulb icon.
Pricing
Image Input Pricing
Image Input Price: USD per 1k images at 1MP (1024x1024)
Reasoning models are indicated by a lightbulb icon.
Pricing: Language Input, Image Input and Language Output
Price (USD per M Tokens); Image Input Price: USD per 1k images at 1MP (1024x1024); Lower is better
Language Input Price
Image Input Price
Language Output Price
Reasoning models are indicated by a lightbulb icon.
Latency & Speed
Latency (Single Image & 1,000 Language Tokens Input)
Seconds to First Token Received; Lower is better
Reasoning models are indicated by a lightbulb icon.
Latency Variance (Single Image & 1,000 Language Tokens Input)
Seconds to First Token Received; Results by percentile; Lower is better
Median; other points represent Min, 25th, 75th percentiles and Max respectively
Reasoning models are indicated by a lightbulb icon.
Output Speed (Single Image & 1,000 Language Tokens Input)
Output Tokens per Second
Reasoning models are indicated by a lightbulb icon.
