Text to Image Benchmarking Methodology
Background and scope of benchmarking
Artificial Analysis performs benchmarking on Text to Image models delivered via serverless API endpoints. This page describes our Text to Image benchmarking methodology, including both our quality benchmarking and performance benchmarking.
We consider Text to Image endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to the system (see 'Note on Midjourney' below for context on Midjourney).
We define a concept of a default settings, 1024x1024 sized, image that we generate across all models and providers. We use images generated with identical settings for each model, including inference steps, for the purposes of each key metric to ensure comparability. For example, we benchmark Stable Diffusion XL 1.0 with 30 inference steps: we generate images for voting in the Image Arena with identical settings to what we use for Generation Time testing and Price per 1,000 images.
Key Metrics
We use the following metrics to track quality, performance and price for Text to Image models.
Quality ELO: Relative ELO score of the models as determined by >100,000 responses from users in the Artificial Analysis Image Arena.
Some models may not be shown due to not yet having enough votes. We use a similar Linear Regression model, similar to how LMSys calculates ELO scores for Chatbot Arena.
Price per 1,000 Images: Provider's price (USD) per image generated, multiplied by 1,000.
For providers which price based on inference time, we have estimated pricing based on their inference time across ~100 images and price per inference time. This methodology has been applied for Replicate, Fal, and Deepinfra.
For providers which price based on inference steps, we multiply the price per inference step by the number of inference steps for the model (as per below, standardized across providers). This methodology has been applied for Fireworks, Amazon Bedrock, and Together.ai.
For providers which price based on a subscription plan, we assume 80% utilization when the number of generations provided are on a monthly basis and 70% when generations are provided on a daily basis.
Generation Time: Median time the provider takes to generate a single image, calculated over the past 14 days of measurements.
Images are generated at batch size of 1 where possible. Generation of batch sizes larger than 1 are currently out of scope (however please note Midjourney details below). Generation Time includes downloading the image from the provider where a URL is provided rather than an image response. This is to reflect the end-user latency of receiving a generated image and as URLs can be generated prior to image completion.
Generation Time Testing Methodology
Key technical details:
- Benchmarking is conducted 4 times daily at random times.
- A unique prompt is used for each generation.
- Watermarks and safety checkers are disabled where possible.
Model Inference Steps
The dominant model architecture for Text to Image models is called a diffusion model. Diffusion models for image generation work by denoising an image in a number of steps, known as diffusion steps or inference steps.
Many open source Text to Image models support setting the number of inference steps as an input. To allow for fair comparison between models, for models that take a number of inference steps as an input we have used the default value used by the model creator, or a median of the default number of steps across providers.
We use the same number of inference steps for each model across all tests - including generating images for the Image Arena to collect ELO scores, measuring performance and calculating pricing.
Only models with an inference steps value are shown below.
Model and Provider Inclusion Criteria
Our objective is to analyze and compare popular and high-performing Text to Image models and providers to support end-users in choosing which to use. As such, we apply an 'industry significance' and competitive performance test to evaluate the inclusion of new models and providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models or providers, please contact us via the contact page.
Statement of Independence
Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.
Note on Midjourney
Midjourney does not have a developer API and is only available via fixed-price subscription. This means it does not satisfy our definition of serverless endpoint. We have, however, elected to include Midjourney in our Text to Image benchmarking because of its unique importance to the Text to Image market. As of May 2024, Midjourney leads our quality ELO score results.
There are several important caveats to our Midjourney benchmark results:
- As Midjourney does not have a first-party API, we use ImagineAPI to access Midjourney via API for benchmarking. ImagineAPI connects to Midjourney's Discord channel via a user's Discord account and uses websockets to monitor for completion.
- All prompts submitted on Midjourney return four images by default. This means that Midjourney's Generation Time results should be thought of as analogous to being run at a batch size of 4, and care should be taken making direct comparisons between Midjourney and other providers. We choose to display Midjourney Generation Time results because the figure accurately represents the time to get a Midjourney image result.
- As Midjourney does not provide per image pricing, we have calculated an equivalent price per 1000 images based on the maximum number of 'fast generation' images that can be generated under Midjourney's subscription pricing. Our pricing calculation treats each prompt as a single image, ignoring the benefit of returning four images for every prompt. We choose to display Midjourney pricing per 1000 images to give a realistic view of the cost of Midjourney if used to generate images in a similar way to serverless APIs.