Image Generation Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on image generation models delivered via serverless API endpoints. This page describes our image generation benchmarking methodology, including both our quality benchmarking and performance benchmarking. We cover two submodalities:

Text to Image: models that generate an image from a text prompt only.
Image Editing: models that modify a reference image based on a text prompt.

We consider image generation endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to the system.

We define a concept of a default settings, 1024x1024 sized, image that we generate across all models and providers. We use images generated with identical settings for each model, including inference steps, to ensure comparability. For example, we benchmark Stable Diffusion XL 1.0 with 30 inference steps: we generate images for voting in the Image Arena with identical settings to what we use for Generation Time testing and Price per 1,000 images.

Key Metrics

We use the following metrics to track quality, performance and price for image generation models.

Quality Elo

Relative Elo score of the models as determined by millions of responses from users in the Artificial Analysis Image Arena.

We use Bradley-Terry Maximum Likelihood Estimation to calculate ratings, presented on an Elo-like scale for readability. This is the same methodology used for our Video Arena.

Elo is reported separately for each submodality: Text to Image and Image Editing. Users in the arena only compare outputs within the same submodality.

Price per 1,000 Images

Provider's price (USD) per image generated, multiplied by 1,000.

For providers which price based on inference time, we have estimated pricing based on their inference time across ~100 images and price per inference time. This methodology has been applied for Replicate, Fal, and Deepinfra.

For providers which price based on inference steps, we multiply the price per inference step by the number of inference steps for the model (as per below, standardized across providers). This methodology has been applied for Fireworks, Amazon Bedrock, and Together.ai.

For providers which price based on a subscription plan, we assume 80% utilization when the number of generations provided are on a monthly basis and 70% when generations are provided on a daily basis.

Generation Time

Median time the provider takes to generate a single image, calculated over the past 14 days of measurements.

Images are generated at batch size of 1 where possible. Generation Time includes downloading the image from the provider where a URL is provided rather than an image response. This is to reflect the end-user latency of receiving a generated image and as URLs can be generated prior to image completion.

Generation Time Testing Methodology

Key technical details:

Benchmarking is conducted 4 times daily at random times.
A unique prompt is used for each generation.
Watermarks and safety checkers are disabled where possible.

Model Inference Steps

The dominant model architecture for Text to Image models is called a diffusion model. Diffusion models for image generation work by denoising an image in a number of steps, known as diffusion steps or inference steps.

Many open source Text to Image models support setting the number of inference steps as an input. To allow for fair comparison between models, for models that take a number of inference steps as an input we have used the default value used by the model creator, or a median of the default number of steps across providers.

We use the same number of inference steps for each model across all tests - including generating images for the Image Arena to collect Elo scores, measuring performance and calculating pricing.

Only models with an inference steps value are shown below.

Model Name	Inference Steps (default)
FLUX.1 [dev]	28
FLUX.1 [pro]	28
FLUX.1 [schnell]	4
FLUX1.1 [pro]	50
Lumina Image v2	30
Playground v2.5	50
Playground v3 (beta)	50
SDXL Lightning	4
Stable Diffusion 1.5	50
Stable Diffusion 1.6	50
Stable Diffusion 3 Medium	30
Stable Diffusion 3.5 Large	35
Stable Diffusion 3.5 Large Turbo	4
Stable Diffusion 3.5 Medium	40
Stable Diffusion XL 1.0	30

Model & Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing Text to Image models and providers to support end-users in choosing which to use. As such, we apply an 'industry significance' and competitive performance test to evaluate the inclusion of new models and providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models or providers, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.