Video Generation Benchmarking Methodology
Scope & Background
Artificial Analysis performs benchmarking on video generation models delivered via serverless API endpoints. This page describes our video generation benchmarking methodology, including both our quality benchmarking and performance benchmarking. We cover two modalities:
- Text to Video: models that generate video from a text prompt only.
- Image to Video: models that generate video from a reference image, and where supported an accompanying text prompt.
Where a model generates video with synchronized audio, we additionally track these outputs as separate submodalities — Text to Video with Audio and Image to Video with Audio — each with their own arena ELO rankings, so that silent and audio-enabled outputs are not compared directly.
We consider video generation endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to the system.
We define a concept of default settings that we apply to every video we generate across all models and providers. We use videos generated with identical settings for each model — including resolution, frame rate, duration, aspect ratio, seed, guidance, and inference steps — to ensure comparability. The full default settings are listed in Default Generation Settings (V1.0) below.
Key Metrics
We use the following metrics to track quality, performance and price for video generation models.
Quality ELO
Relative ELO score of the models as determined by responses from users in the Artificial Analysis Video Arena.
We use Bradley-Terry Maximum Likelihood Estimation to calculate ratings, presented on an ELO-like scale for readability. This is the same methodology used for our Image Arena.
ELO is reported separately for each submodality: Text to Video, Image to Video, Text to Video with Audio, and Image to Video with Audio. Users in the arena only compare outputs within the same submodality.
Price per Minute of Video
Provider's price (USD) to generate one minute of video at the default generation settings.
For providers which price per minute of generated video, we use their published per-minute rate at the default generation settings. Providers that publish a per-second rate are converted to an equivalent per-minute rate.
For providers which price based on inference time, we have estimated pricing based on their inference time across our benchmark runs and price per inference time.
For providers which price per generated video at a fixed duration (e.g., per 5-second clip), we divide the price by the video duration and normalize to an equivalent per-minute rate.
For providers which price based on a subscription plan, we assume 80% utilization when generations are provided on a monthly basis and 70% when generations are provided on a daily basis.
Generation Time
Median time the provider takes to generate a single video at the default generation settings, calculated over the past 14 days of measurements.
Generation Time is measured end-to-end: from the moment the request is submitted to the provider, through any queue time, through inference and video encoding, through downloading the completed video from the provider. For asynchronous APIs (submit → poll → download), this includes all three phases.
For Image to Video models, Generation Time also includes any time required to upload the reference image to the provider. For providers that accept a reference-image URL, there is no upload step and Generation Time excludes image upload.
Default Generation Settings (V1.0)
These settings apply to both Text to Video and Image to Video benchmarks unless otherwise specified.
| Setting | Value |
|---|---|
| Resolution | 1080p (or closest supported value) |
| Frame rate | 24 FPS (or closest supported value) |
| Duration | 5 seconds (or equivalent number of frames when FPS cannot be set) |
| Aspect ratio | 16:9 horizontal |
| Seed | 42 |
| Prompt enhancement | Off (unless the provider does not allow it) |
| Negative prompt | Empty string (explicitly set to "" where providers default to a non-empty value) |
| "Turbo" / "Accelerated" mode | Off |
| Watermarks | Off |
| Loop | Off |
| Audio | Off (enabled for with-audio submodalities) |
| Video interpolation | Off |
| Safety input/output checks | Off (where configurable) |
| Compression / quality | Provider default |
| CFG / guidance | Model creator's default, or the first host to market where no creator default is published |
| Inference steps | Model creator's default, or the first host to market where no creator default is published |
Where a provider does not support the exact default, we use the closest supported value, breaking ties upward (e.g., when 24 FPS is not available, we prefer 30 FPS over 18 FPS since both are equidistant from the default). The Text to Video with Audio and Image to Video with Audio submodalities benchmark models with audio output enabled; all other defaults apply.
Text to Video
Text to Video models take a text prompt as their only input. Prompts are drawn from a curated and reproducible prompt set.
Image to Video
Image to Video models take a reference image and, where supported, an accompanying text prompt. Reference images used in our benchmarks follow these standards:
- Format: PNG (lossless) — we do not use JPG/JPEG to avoid the compression artifacts lossy encoding can introduce.
- Resolution: 1920×1080 (16:9, matching the target video resolution).
- Delivery: Served from Amazon S3 through a Cloudflare CDN and passed to the provider as a URL.
- Byte fallback: Where a provider does not accept URLs, or requires a specific size or format, we download the image, apply the required transformations, and upload the bytes directly.
- Upload time: Where a provider requires us to upload the image to their platform as a pre-generation step, that upload time is counted toward Generation Time.
Generation Time Testing Methodology
Key technical details:
- Benchmarking is conducted approximately every 3 days.
- Generation Time is reported as the median (P50) across the trailing 14 days of measurements.
- A curated prompt set is used across runs, with a fixed seed of 42, so outputs are reproducible and directly comparable.
- Our testing infrastructure is hosted on Google Cloud in the us-central region.
- For providers with asynchronous APIs (submit → poll → download), Generation Time includes queue wait, generation, and download.
Model & Provider Inclusion Criteria
Our objective is to analyze and compare popular and high-performing Text to Video and Image to Video models and providers to support end-users in choosing which to use. As such, we apply an 'industry significance' and competitive performance test to evaluate the inclusion of new models and providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models or providers, please contact us via the contact page.
Statement of Independence
Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.