Video Generation Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on video generation models delivered via serverless API endpoints. This page describes our video generation benchmarking methodology, including both our quality benchmarking and performance benchmarking. We cover two modalities:

  • Text to Video: models that generate video from a text prompt only.
  • Image to Video: models that generate video from a reference image, and where supported an accompanying text prompt.

Where a model generates video with synchronized audio, we additionally track these outputs as separate submodalities — Text to Video with Audio and Image to Video with Audio — each with their own arena ELO rankings, so that silent and audio-enabled outputs are not compared directly.

We consider video generation endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to the system.

We define a concept of default settings that we apply to every video we generate across all models and providers. We use videos generated with identical settings for each model — including resolution, frame rate, duration, aspect ratio, seed, guidance, and inference steps — to ensure comparability. The full default settings are listed in Default Generation Settings (V1.0) below.

Key Metrics

We use the following metrics to track quality, performance and price for video generation models.

Quality ELO

Relative ELO score of the models as determined by responses from users in the Artificial Analysis Video Arena.

We use Bradley-Terry Maximum Likelihood Estimation to calculate ratings, presented on an ELO-like scale for readability. This is the same methodology used for our Image Arena.

ELO is reported separately for each submodality: Text to Video, Image to Video, Text to Video with Audio, and Image to Video with Audio. Users in the arena only compare outputs within the same submodality.

Price per Minute of Video

Provider's price (USD) to generate one minute of video at the default generation settings.

For providers which price per minute of generated video, we use their published per-minute rate at the default generation settings. Providers that publish a per-second rate are converted to an equivalent per-minute rate.

For providers which price based on inference time, we have estimated pricing based on their inference time across our benchmark runs and price per inference time.

For providers which price per generated video at a fixed duration (e.g., per 5-second clip), we divide the price by the video duration and normalize to an equivalent per-minute rate.

For providers which price based on a subscription plan, we assume 80% utilization when generations are provided on a monthly basis and 70% when generations are provided on a daily basis.

Generation Time

Median time the provider takes to generate a single video at the default generation settings, calculated over the past 14 days of measurements.

Generation Time is measured end-to-end: from the moment the request is submitted to the provider, through any queue time, through inference and video encoding, through downloading the completed video from the provider. For asynchronous APIs (submit → poll → download), this includes all three phases.

For Image to Video models, Generation Time also includes any time required to upload the reference image to the provider. For providers that accept a reference-image URL, there is no upload step and Generation Time excludes image upload.

Default Generation Settings (V1.0)

These settings apply to both Text to Video and Image to Video benchmarks unless otherwise specified.

SettingValue
Resolution1080p (or closest supported value)
Frame rate24 FPS (or closest supported value)
Duration5 seconds (or equivalent number of frames when FPS cannot be set)
Aspect ratio16:9 horizontal
Seed42
Prompt enhancementOff (unless the provider does not allow it)
Negative promptEmpty string (explicitly set to "" where providers default to a non-empty value)
"Turbo" / "Accelerated" modeOff
WatermarksOff
LoopOff
AudioOff (enabled for with-audio submodalities)
Video interpolationOff
Safety input/output checksOff (where configurable)
Compression / qualityProvider default
CFG / guidanceModel creator's default, or the first host to market where no creator default is published
Inference stepsModel creator's default, or the first host to market where no creator default is published

Where a provider does not support the exact default, we use the closest supported value, breaking ties upward (e.g., when 24 FPS is not available, we prefer 30 FPS over 18 FPS since both are equidistant from the default). The Text to Video with Audio and Image to Video with Audio submodalities benchmark models with audio output enabled; all other defaults apply.

Text to Video

Text to Video models take a text prompt as their only input. Prompts are drawn from a curated and reproducible prompt set.

Image to Video

Image to Video models take a reference image and, where supported, an accompanying text prompt. Reference images used in our benchmarks follow these standards:

  • Format: PNG (lossless) — we do not use JPG/JPEG to avoid the compression artifacts lossy encoding can introduce.
  • Resolution: 1920×1080 (16:9, matching the target video resolution).
  • Delivery: Served from Amazon S3 through a Cloudflare CDN and passed to the provider as a URL.
  • Byte fallback: Where a provider does not accept URLs, or requires a specific size or format, we download the image, apply the required transformations, and upload the bytes directly.
  • Upload time: Where a provider requires us to upload the image to their platform as a pre-generation step, that upload time is counted toward Generation Time.

Generation Time Testing Methodology

Key technical details:

  • Benchmarking is conducted approximately every 3 days.
  • Generation Time is reported as the median (P50) across the trailing 14 days of measurements.
  • A curated prompt set is used across runs, with a fixed seed of 42, so outputs are reproducible and directly comparable.
  • Our testing infrastructure is hosted on Google Cloud in the us-central region.
  • For providers with asynchronous APIs (submit → poll → download), Generation Time includes queue wait, generation, and download.

Model & Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing Text to Video and Image to Video models and providers to support end-users in choosing which to use. As such, we apply an 'industry significance' and competitive performance test to evaluate the inclusion of new models and providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models or providers, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.