Video Generation Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on video generation models typically delivered via serverless API endpoints. This page describes the methodology used to measure the quality, speed, and price of video generation models.

For video generation capabilities, we cover six modalities:

  • Text to Video: models that generate video from a text prompt only.
  • Image to Video: models that generate video using a reference image as the first frame, and where supported an accompanying text prompt.
  • Video Editing: models that generate video from a video input and a text prompt.
  • Text to Video with Audio: Text to Video models that generate synchronized audio alongside the video.
  • Image to Video with Audio: Image to Video models that generate synchronized audio alongside the video.
  • Video Editing with Audio: Video Editing models that generate synchronized audio alongside the video.

Each modality has its own Elo pool in the Video Arena, so silent and audio-enabled outputs are not compared directly, and editing is not compared with generation.

We apply a standard set of default settings to every video we generate across all models and providers. We use videos generated with identical settings for each model, including resolution, frame rate, duration, aspect ratio, seed, guidance, and inference steps, to ensure comparability. The full default settings are listed in Default Generation Settings below.

Key Metrics

We use the following metrics to track quality, performance and price for video generation models.

Quality Elo

Each model's Elo score reflects its relative quality, derived from user votes in the Artificial Analysis Video Arena. In the arena, users compare two videos generated from the same prompt by different models and select the one they prefer. We aggregate these pairwise votes using Bradley-Terry Maximum Likelihood Estimation and rescale them to an Elo-like range for readability. Ratings are recomputed hourly.

Elo is reported separately for each modality (Text to Video, Image to Video, Video Editing, Text to Video with Audio, Image to Video with Audio, Video Editing with Audio), as Arena matchups only pair outputs from the same modality.

Price per Minute of Video

Provider's price (USD) to generate one minute of video at the default generation settings.

Pricing is taken directly from each provider's published per-minute rate.

Generation Time

Median time the provider takes to generate a single video at the default generation settings, calculated over the past 14 days.

Generation Time is measured end-to-end: from the moment the request is submitted to the provider, through any queue time, through inference and video encoding, through downloading the completed video from the provider. For asynchronous APIs (submit → poll → download), this includes all three phases.

For Image to Video models, Generation Time also includes any time required to upload the reference image to the provider. For providers that accept a reference-image URL, there is no upload step and Generation Time excludes image upload.

Default Generation Settings

These settings apply to both Text to Video and Image to Video benchmarks unless otherwise specified.

SettingValue
Resolution1080p (or closest supported value)
Frame rate24 FPS (or closest supported value)
Duration10 seconds (or equivalent number of frames when duration must be specified in frames)
Aspect ratio16:9 horizontal
Seed42 (where the model exposes a seed parameter)
Prompt enhancementOn when recommended by the model creator, otherwise off
CFG / guidanceFirst-party API default, or the open-source repository default
Inference stepsFirst-party API default, or the open-source repository default

Where a provider does not support the exact default, we use the closest supported value, breaking ties upward (e.g., when 24 FPS is not available, we prefer 30 FPS over 18 FPS since both are equidistant from the default). The Text to Video with Audio and Image to Video with Audio submodalities benchmark models with audio output enabled; all other defaults apply.

Text to Video

Text to Video models take a text prompt as their only input. Prompts are drawn from a curated and reproducible prompt set.

Image to Video

Image to Video models take a reference image and, where supported, an accompanying text prompt. Reference images used in our benchmarks follow these standards:

  • Format: PNG (lossless). We do not use JPG/JPEG to avoid the compression artifacts lossy encoding can introduce.
  • Resolution: 1920×1080 (16:9, matching the target video resolution).
  • Byte fallback: Where a provider does not accept URLs, or requires a specific size or format, we download the image, apply the required transformations, and upload the bytes directly.
  • Upload time: Where a provider requires us to upload the image to their platform as a pre-generation step, that upload time is counted toward Generation Time.

Generation Time Testing Methodology

Key technical details:

  • Cadence: Text to Video runs once per day at random times.
  • Sample size: the headline Generation Time is the median of successful measurements from the trailing 14 days. With daily cadence, Text to Video typically accumulates around 14 samples per host per window.
  • Prompt selection: each run draws a single random prompt from a curated database pool and submits it to every active host model.
  • Seed: a fixed seed of 42 is used where the model exposes a seed parameter, so outputs are reproducible across runs.
  • End-to-end timing: Generation Time is measured end-to-end from the first API call to the completed video download. For providers with asynchronous APIs (submit, poll, download), it includes queue wait, inference, and download.
  • API mode: we poll the job status every 100 milliseconds so the measured queue wait reflects actual provider time, not poll granularity.
  • Scope: Generation Time is currently measured for silent Text to Video and Image to Video models only. Video Editing and the three audio modalities (Text to Video with Audio, Image to Video with Audio, Video Editing with Audio) are ranked via the Video Arena (Quality Elo) but are not currently latency-benchmarked.
  • Aggregation: median, p05, p25, p75, and p95 are computed from the raw distribution of successful measurements with no outlier trimming.
  • What is not measured: provider cold-start delay, authentication handshakes, retry overhead, and any client warm-up. Each entry in the dataset is a single one-shot measurement.
  • Infrastructure: runs on Google Cloud Run in us-central1.
  • New endpoints: for newly added endpoints, including those benchmarked ahead of public availability, we may adjust benchmarking cadence to build a distribution of generation times before results are listed. Published results are calculated over the trailing 14 days of measurements, so figures for newly listed endpoints reflect initial runs and may change as further measurements accumulate.

Model & Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing video generation models to help users choose between them. As such, we apply tests for industry significance and competitive performance to evaluate the inclusion of new models and providers. We continuously refine these criteria and welcome feedback or suggestions. To suggest models or providers, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or for favorable outcomes on Artificial Analysis.