Vision Models: LLMs with Image Input Capabilities
Compare multimodal LLMs that support image and text input with the Artificial Analysis Visual Reasoning Index. Compare performance, pricing, and latency across providers to choose the best image-capable LLM for vision workloads. For further details, see the methodology page.
Highlights
Summary Analysis
Visual Reasoning vs. Image Input Price
Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.
Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.
Visual Reasoning vs. Latency (Single Image & 1,000 Language Tokens Input)
Based on the MMMU Pro evaluation of 1.7k questions, this represents the model's ability to interpret and reason over images.
Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.
Intelligence
Visual Reasoning Intelligence (MMMU Pro evaluation)
Multimodal reasoning quality evaluation based on 1.7k questions which require interpreting and reasoning over images.
Pricing
Image Input Pricing
Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.
Pricing: Language Input, Image Input and Language Output
Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Price per token generated by the model (received from the API), represented as USD per million Tokens.
Price for 1,000 images at a resolution of 1 Megapixel (1024 x 1024) processed by the model.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Latency & Speed
Latency (Single Image & 1,000 Language Tokens Input)
Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Latency Variance (Single Image & 1,000 Language Tokens Input)
Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Output Speed (Single Image & 1,000 Language Tokens Input)
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).