Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
Insights Login
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • MicroEvals
  • Articles
Insights Login
Benchmarking Methodology
On this page
  • Scope & Background
  • Key Metrics
    • Quality ELO
    • Price per 1M Characters
    • Generation Time
  • Model Voices
  • Model and Provider Inclusion Criteria
  • Statement of Independence

Text to Speech Benchmarking Methodology

Scope & Background

Artificial Analysis performs benchmarking on Text to Speech models delivered via serverless API endpoints. This page describes our Text to Speech benchmarking methodology, including both our quality benchmarking and performance benchmarking. We consider Text to Speech endpoints to be serverless when customers only pay for usage, not a fixed rate for access.

For both our performance benchmarking and within the Speech Arena, our focus is reflecting the end-user experience of users using the serverless APIs. We focus on benchmarking the time to receive the audio file locally. Where the API response is a URL rather than bytes, we include the time of downloading the file in our response time measurement. Our approach is to use the standard implementation of provider APIs as suggested by each provider's documentation. Where on option on the provider's API, we standardize the sample rate of audio to 22.05 kHz.

Key Metrics

We use the following metrics to track quality, performance and price for Text to Speech models.

Quality ELO

Relative ELO score of the models as determined by responses from users in the Artificial Analysis Text to Speech Arena.

Some models may not be shown due to not yet having enough votes. We use a similar Linear Regression model, similar to how LMSYS calculates ELO scores for Chatbot Arena.

Price per 1M Characters

For providers which do not provide a price per 1M characters, we have estimated pricing based on the following alternative methodologies.

For providers which charge based on inference time, we have estimated pricing based on their inference time using a dataset of ~25 texts of ~500 characters. This methodology has been applied for Replicate & fal.ai.

For providers which only offer subscription plans, we select the plan priced closest to $300 per month, which is usually representative of 'Scaled' used of the API, and we assume 80% utilization of the characters offered by that plan. For example, if a $300 plan includes 1 million characters we assume 800,000 characters are used and the price per 1M characters is (300/(1*80%) = $375 per 1M characters). This methodology has been applied for ElevenLabs, Cartesia and LMNT.

Note: when reporting price, we do not include temporary discounts.

Generation Time

Median time the provider takes to generate a single audio clip with ~500 input characters, calculated over the past 14 days of measurements.

Generation Time includes downloading the audio clip from the provider where a URL is provided rather than an audio response. This is to reflect the end-user latency of receiving a generated audio clip and as URLs can be generated prior to audio completion. Audio clips are generated at batch size of 1 where relevant.

Benchmarking is conducted 4 times daily at random times each day. For each benchmarking evlauation we select a single random voice for each model. A unique prompt of ~500 characters is used for each generation.

Model Voices

For each model tested, we test multiple voices to ensure that our comparison between models is representative and fair. Voice characteristics such as accent, gender and style are typically aspects of the voices that each model can generate speech for, not the underlying model. For each model we select 2 voices of each combination of Male and Female, and US and UK accents (8 combinations in total). Where a gender and accent is not available, we exclude this combination from evaluation in the Speech Arena.

Voices are selected for each model based on their prominence in provider interface and documentation, excluding voices which are not neutral in nature (e.g. Los Angeles valley and deep southern accents are excluded). Creators of the models may also request we use specific voices where many are available. Where voices are not provided, as is typically the case for open source models, we use voices clips from professional voice actors as source files for generating speech. All voice clips have been licensed for commercial use.

Below, we list the voices used for each model.

Model NameModel CreatorVoices Used (Gender, Accent)
Azure Neural

Azure NeuralMicrosoft Azure

Andrew Multilingual (M, US), Brian Multilingual (M, US), Ryan (M, UK), Emma Multilingual (F, US), Ava Multilingual (F, US), Sonia (F, UK), Libby (F, UK), Alfie (M, UK)

Chatterbox

ChatterboxResemble AI

Amy (F, UK), Redd (F, UK), Dave (M, UK), Tom (M, UK), Abbey (F, US), Susan (F, US), Alan (M, US), Michael (M, US)

Chatterbox HD

Chatterbox HDResemble AI

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

ElevenLabs v3

ElevenLabs v3ElevenLabs

Jessica (F, US), George (M, UK), Liam (M, US), Lily (F, UK), Eric (M, US), River (F, US), Daniel (M, UK), Alice (F, UK)

Fish Speech 1.5

Fish Speech 1.5Fish Audio

UK Female Redd (F, UK), UK Male Dave (M, UK), US Female Susan (F, US), UK Male Tom (M, UK), US Male Alan (M, US), UK Female Amy (F, UK), US Female Abbey (F, US), US Male Michael (M, US)

Flash v2.5

Flash v2.5ElevenLabs

Jessica (F, US), Eric (M, US), Lily (F, UK), Liam (M, US), Alice (F, UK), Daniel (M, UK), River (F, US), George (M, UK)

GPT-4o mini TTS

GPT-4o mini TTSOpenAI

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

Inworld TTS 1

Inworld TTS 1Inworld

Ashley (F, US), Mark (M, US), Theodore (M, US), Deborah (F, US), Olivia (F, UK), Ronald (M, UK), Craig (M, UK), Wendy (F, UK)

Journey

JourneyGoogle

en-US-Journey-D (M, US), en-US-Journey-F (F, US)

Kokoro 82M v1.0

Kokoro 82M v1.0Kokoro

Isabella (F, UK), Fenrir (M, US), Fable (M, UK), George (M, UK), Bella (F, US), Michael (M, US), Emma (F, UK), Aoede (F, US)

LMNT

LMNTLMNT

daniel (M, US), terrence (M, US), lily (F, US), chloe (F, US), morgan (F, UK)

Magpie Multilingual

Magpie MultilingualNVIDIA

Leo (M, US), Mia (F, US), Aria (F, US), Jason (M, US)

MetaVoice v1

MetaVoice v1MetaVoice

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

Multilingual v2

Multilingual v2ElevenLabs

Lily (F, UK), Liam (M, US), Eric (M, US), River (F, US), Jessica (F, US), Daniel (M, UK), George (M, UK), Alice (F, UK)

Murf Speech Gen 2

Murf Speech Gen 2Murf AI

Natalie (F, US), Theo (M, UK), Carter (M, US), Phoebe (F, US), Mason (M, UK), Ruby (F, UK), Terrell (M, US), Hazel (F, UK)

Neural2

Neural2Google

en-US-Neural2-I (M, US), en-US-Neural2-A (M, US), en-US-Neural2-H (F, US), en-US-Neural2-C (F, US), en-GB-Neural2-B (M, UK), en-GB-Neural2-D (M, UK), en-GB-Neural2-C (F, UK), en-GB-Neural2-A (F, UK)

Octave 2

Octave 2Hume AI

FEMALE MEDITATION GUIDE (F, US), NATURE DOCUMENTARY NARRATOR (M, UK), ALICE BENNETT (F, UK), MALE PROTAGONIST (M, US), LADY ELIZABETH (F, UK), SAD OLD BRITISH MAN (M, UK), DONOVAN SINCLAIR (M, US), SITCOM GIRL (F, US)

Octave TTS

Octave TTSHume AI

MALE PROTAGONIST (M, US), DONOVAN SINCLAIR (M, US), FEMALE MEDITATION GUIDE (F, US), SITCOM GIRL (F, US), ALICE BENNETT (F, UK), LADY ELIZABETH (F, UK), NATURE DOCUMENTARY NARRATOR (M, UK), SAD OLD BRITISH MAN (M, UK)

OpenAudio S1

OpenAudio S1Fish Audio

US Female Susan (F, US), US Male Michael (M, US), US Male Alan (M, US), US Female Abbey (F, US), UK Male Dave (M, UK), UK Female Redd (F, UK), UK Male Tom (M, UK), UK Female Amy (F, UK)

OpenAudio S1 Mini

OpenAudio S1 MiniFish Audio

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

OpenVoice v2

OpenVoice v2OpenVoice

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

Polly Generative

Polly GenerativeAmazon

Matthew (M, US), Stephen (M, US), Ruth (F, US), Danielle (F, US)

Polly Long-Form

Polly Long-FormAmazon

Gregory (M, US), Danielle (F, US), Ruth (F, US)

Polly Neural

Polly NeuralAmazon

Emma (F, UK), Joey (M, US), Gregory (M, US), Joanna (F, US), Danielle (F, US), Brian (M, UK), Brian (M, UK), Amy (F, UK)

Polly Standard

Polly StandardAmazon

Joey (M, US), Joanna (F, US), Brian (M, UK), Amy (F, UK)

Qwen3 TTS Flash

Qwen3 TTS FlashAlibaba

Cherry (F, US), Ryan (M, US), Jennifer (F, US), Ethan (M, US)

Simba

SimbaSpeechify

Patricia (F, US), Robert (M, US), Austin (M, UK), Douglas (M, US), Christina (F, US), Derek (M, UK), Beverly (F, UK), carol (F, UK)

Sonic English (Oct '24)

Sonic English (Oct '24)Cartesia

Nonfiction Man (M, US), Newsman (M, US), Classy British Man (M, UK), Polite Man (M, UK), Helpful Woman (F, US), British Lady (F, UK), Southern Woman (F, US), British Narration Lady (F, UK)

Speech-02-HD

Speech-02-HDMiniMax

English Powerful Female (F, UK), English Sweet Female (F, US), English Lively Male (M, US), English_Magnetic_Male_2 (M, UK)

Speech-02-Turbo

Speech-02-TurboMiniMax

English Lively Male (M, US), English Sweet Female (F, US), English Magnetic Male (M, UK), English Powerful Female (F, UK)

Standard

StandardGoogle

en-US-Standard-C (F, US), en-GB-Standard-A (F, UK), en-GB-Standard-D (M, UK), en-GB-Standard-C (F, UK), en-GB-Standard-B (M, UK), en-US-Standard-F (F, US), en-US-Standard-I (M, US), en-US-Standard-A (M, US)

Step TTS Mini

Step TTS MiniStepFun

Alan (M, US), Susan (F, US), Abbey (F, US), Tom (M, UK), Dave (M, UK), Redd (F, UK), Amy (F, UK), Michael (M, US)

Studio

StudioGoogle

en-US-Studio-Q (M, US), en-US-Studio-O (F, US), en-GB-Studio-C (F, UK), en-GB-Studio-B (M, UK)

StyleTTS 2

StyleTTS 2StyleTTS

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

T2A-01-HD

T2A-01-HDMiniMax

Wise Lady (F, UK), Decent Young Man (M, UK), Wise Scholar (M, UK), Mature Boss (F, US), English Steady Mentor (F, US), Gentle-voiced_man (M, US), English Whimsical Girl (F, US), Female Narrator (F, UK)

T2A-01-Turbo

T2A-01-TurboMiniMax

Anime Character (F, UK), Boss Lady (F, US), English Steady Mentor (M, US), Wise Lady (F, UK), English Whimsical Girl (F, US), English Gentle Voiced Man (M, US), English Wise Scholar (M, UK), Decent Young Man (M, UK)

TTS-1

TTS-1OpenAI

echo (M, US), nova (F, US), shimmer (F, US), onyx (M, US), fable (M, UK), alloy (F, US)

TTS-1 HD

TTS-1 HDOpenAI

nova (F, US), alloy (F, US), onyx (M, US), shimmer (F, US), echo (M, US), fable (M, UK)

Turbo v2.5

Turbo v2.5ElevenLabs

River (F, US), Jessica (F, US), Daniel (M, UK), Alice (F, UK), Liam (M, US), Lily (F, UK), George (M, UK), Eric (M, US)

VibeVoice 1.5B

VibeVoice 1.5BMicrosoft Azure

Alice (F, US), Carter (M, US), Frank (M, US), Maya (F, US)

VibeVoice 7B

VibeVoice 7BMicrosoft Azure

Alice (F, US), Maya (F, US), Carter (M, US), Frank (M, US)

WaveNet

WaveNetGoogle

en-US-Wavenet-I (M, US), en-US-Wavenet-B (M, US), en-US-Wavenet-C (F, US), en-GB-Wavenet-B (M, UK), en-GB-Wavenet-D (M, UK), en-GB-Wavenet-C (F, UK), en-GB-Wavenet-A (F, UK), en-US-Wavenet-F (F, US)

XTTS v2

XTTS v2Coqui

Susan (F, US), Redd (F, UK), Abbey (F, US), Alan (M, US), Michael (M, US), Amy (F, UK), Tom (M, UK), Dave (M, UK)

Zonos-v0.1

Zonos-v0.1Zyphra

Michael (M, US), Dave (M, UK), Tom (M, UK), Amy (F, UK), Abbey (F, US), Redd (F, UK), Susan (F, US), Alan (M, US)

Model and Provider Inclusion Criteria

Our objective is to analyze and compare popular and high-performing Text to Speech models and providers to support end-users in choosing which to use. As such, we apply an 'industry significance' and competitive performance test to evaluate the inclusion of new models and providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models or providers, please contact us via the contact page.

Benchmarking is conducted 4 times daily at random times each day. For each benchmarking evlauation we select a single random voice for each model. A unique prompt of ~500 characters is used for each generation.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.

Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn