Music Generation Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on music generation models typically delivered via serverless API endpoints. This page describes the methodology used to measure model quality through human preference voting in the Artificial Analysis Music Arena.

For music generation capabilities, we cover two modalities:

Instrumental: models generating music without vocals.
Vocals: models generating music with sung or spoken vocal elements.

We apply a standard set of default settings to every track we generate across all models and providers, so outputs are as comparable as possible. These are listed in Default Generation Settings below.

Quality Elo

Each model's Elo score reflects its relative quality, derived from user votes in the Artificial Analysis Music Arena. We compute ratings using Bradley-Terry Maximum Likelihood Estimation and rescale them to an Elo-like range for readability. This is the same methodology used for our Image and Video Arenas. A 95% confidence interval is reported alongside each rating, which narrows as a model receives more votes.

In the arena, users only compare tracks within the same modality. Elo is reported separately for each modality, and broken down further by genre.

Arena Voting

Quality Elo comes entirely from blind, pairwise human preference votes. In each comparison, a user is shown two tracks generated from the same prompt and modality, listens to both, and selects the one they prefer.

Listening requirement: a vote can only be cast after the user has listened to at least 10 seconds of each track, so preferences reflect the overall audio output and not a first impression.
Blind: model names, logos, and any identifying details are hidden until the vote is cast.

Audio Normalization

Before tracks enter the Arena, we normalize every track to a uniform loudness and delivery format, so votes reflect musical quality rather than differences in volume or file characteristics.

Loudness: each track is adjusted toward a -16 LUFS integrated loudness target (ITU-R BS.1770) with a single static gain adjustment; boosts are capped at a -1 dBTP true-peak ceiling to prevent clipping, so tracks without enough headroom sit slightly below the target. We do not apply compression, limiting, or EQ, so each track's dynamics and tonal balance are preserved.
Delivery format: every track is re-encoded to 320 kbps MP3 at 44.1 kHz, with provider metadata and embedded artwork removed, regardless of the format returned by the provider. The original channel layout (mono or stereo) is preserved.

Leaderboards

We publish a Global Leaderboard and a Personal Leaderboard for each modality.

Global Leaderboard: aggregates all eligible votes across users. A model appears once it has received a minimum number of votes.
Personal Leaderboard: a ranking built from a single user's own votes. It unlocks once they have cast a minimum number of votes, and shows a low-confidence warning until they have cast enough for a reliable ranking.

Both leaderboards can additionally be filtered by genre.

Prompts & Genres

Prompts are drawn from a curated, reproducible set covering a range of musical styles, supplemented by user-submitted prompts (between 25 and 200 characters) that we review before adding to the rotation.

Every prompt is tagged with one genre, such as Pop, Hip-Hop / Rap, Electronic (EDM), Classical / Orchestral, and Jazz & Blues. Using the same Bradley-Terry method, we compute a per-genre Elo ranking alongside the overall ranking for each modality, displaying it once the genre has enough votes to be meaningful.

Default Generation Settings

These settings apply to both Instrumental and Vocals benchmarks unless otherwise specified.

Setting	Value
Tracks per prompt	1
Duration	Approximately 3 minutes (or the closest supported length; where a model produces a fixed-length output, we use that length)
Audio format	MP3 where configurable, otherwise the provider default
Sample rate	44.1 kHz where configurable, otherwise the provider default
Seed	42, where the provider exposes a seed parameter
Lyrics	Generated by the model from the prompt (Vocals); not applicable for Instrumental
Vocals	Off for Instrumental; on for Vocals

Music models differ widely in which parameters they expose. Where a provider does not support a default, we use the closest supported value or the provider's default. Settings are held constant for each model across all generations used in the Music Arena.

Instrumental

Instrumental models generate music without vocals. A model is benchmarked in the Instrumental modality only if its endpoint can produce instrumental output, for example through an instrumental flag, or styles and negative-prompt controls that suppress vocals. If an endpoint offers no way to separate instrumental from vocal generation, we benchmark the model in the Vocals modality only.

Vocals

A model is benchmarked in this modality only if it can take a prompt and generate the lyrics and vocals itself. We do not supply lyrics, so the benchmark reflects each model's end-to-end songwriting capability.

Model & Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing music generation models to help users choose between them. As such, we apply tests for industry significance and competitive performance to evaluate the inclusion of new models and providers. We continuously refine these criteria and welcome feedback or suggestions. To suggest models or providers, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or for favorable outcomes on Artificial Analysis.