Image Generation Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on image generation models typically delivered via serverless API endpoints. This page describes the methodology used to measure the quality, speed, and price of image generation models.

For image generation capabilities, we cover two modalities:

  • Text to Image: models that generate an image from a text prompt only.
  • Image Editing: models that modify a reference image based on a text prompt.

Default Generation Settings

We generate images using each model's published defaults. For first-party APIs, this means the API's documented defaults; for open-source models, the defaults from the open-source repository. This covers inference steps, guidance scale, and any default negative prompt behavior. To enable fair comparison across models, we apply the following normalizations:

  • We generate at the highest resolution each model supports, then downscale to 1024×1024 to serve in our Arenas.
  • We generate 1 image per prompt.
  • We use a 1:1 aspect ratio.
  • We use a seed of 42.

For Image Editing models, each prompt is paired with a reference image from a curated set.

Key Metrics

We use the following metrics to track quality, performance and price for image generation models.

Quality Elo

Each model's Elo score reflects its relative quality, derived from user votes in the Image Arena. We compute ratings using Bradley-Terry Maximum Likelihood Estimation and rescale them to an Elo-like range for readability.

Each modality is scored independently, as Arena matchups only pair outputs from the same modality.

Price per 1,000 Images

Provider's price (USD) per image generated, multiplied by 1,000.

Pricing is taken directly from each provider's published per-image rate.

Generation Time

Median time the provider takes to generate a single image, calculated over the past 14 days.

Images are generated at a batch size of 1. Generation Time includes downloading the image from the provider where a URL is provided rather than an image response. This reflects end-user latency, since URLs can be issued before generation completes.

Generation Time Testing Methodology

Key technical details:

  • Cadence: benchmarks run 4 times per day at random times.
  • Sample size: the headline Generation Time is the median of successful measurements from the trailing 14 days, typically around 56 samples per host model per window.
  • What is included: each measurement times the full request lifecycle for one image, covering the API request, provider inference, and either the inline response body or the URL download to disk. Where providers return a URL, the URL is often emitted before the image is fully written, so downloading the bytes is counted to reflect real end-user latency.
  • Unique prompts: a new prompt is generated for each call from a curated pool, so providers cannot return cached responses across runs.
  • Resolution: we generate at 1024×1024. If a model's minimum supported resolution is higher than 1024×1024, we use the smallest supported resolution closest to 1024×1024.
  • API mode: we use the provider's synchronous endpoint when one is available. For async-only providers, we poll the job status every 100 milliseconds so the measured queue wait reflects actual provider time, not poll granularity.
  • Aggregation: median, p05, p25, p75, and p95 are computed from the raw distribution of successful measurements with no outlier trimming.
  • Disabled provider features: watermarks and safety checks are disabled where the API supports it, to remove sources of latency variance unrelated to generation speed.
  • What is not measured: provider cold-start delay, authentication handshakes, retry overhead, and any client warm-up. Each entry in the dataset is a single one-shot measurement.
  • Infrastructure: runs on Google Cloud in us-central1.
  • New endpoints: for newly added endpoints, including those benchmarked ahead of public availability, we may adjust benchmarking cadence to build a distribution of generation times before results are listed. Published results are calculated over the trailing 14 days of measurements, so figures for newly listed endpoints reflect initial runs and may change as further measurements accumulate.

Model & Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing image generation models to help users choose between them. As such, we apply tests for industry significance and competitive performance to evaluate the inclusion of new models and providers. We continuously refine these criteria and welcome feedback or suggestions. To suggest models or providers, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or for favorable outcomes on Artificial Analysis.