Menu

logo
Artificial Analysis
HOME

Text to Speech AI Model & Provider Leaderboard

Analysis and comparison of Text to Speech generation models & API providers. Artificial Analysis has analyzed text to speech models and hosting providers across quality, generation time, and price. For further details, see our methodology page.

Text to speech models & providers compared: Standard, OpenAI TTS, HD, OpenAI TTS, Studio, Google Cloud TTS, Journey, Google Cloud TTS, Neural2, Google Cloud TTS, WaveNet, Google Cloud TTS, Standard, Google Cloud TTS, Long-form, Amazon Polly, Neural, Amazon Polly, Standard, Amazon Polly, Neural, Microsoft Azure, MetaVoice v1, XTTS v2, StyleTTS 2, OpenVoice v2, Sonic English (Oct '24), Cartesia, Turbo v2.5, ElevenLabs, Multilingual v2, ElevenLabs, and LMNT.

Highlights

Quality ELO
Arena ELO: Average ELO rating of the model, Higher is better
Characters per Second
Characters processed per second: # of characters per second of generation time, Higher is better
Price
Price: USD per 1M characters of text, Lower is better

Summary Analysis

Quality vs. Price

Arena ELO: Average ELO rating of the model, Price: USD per 1M characters of text
Most attractive quadrant
Size represents Characters processed per second: # of characters per second of generation time
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.

Quality vs. Speed

Arena ELO: Average ELO rating of the model, Characters processed per second: # of characters per second of generation time
Most attractive quadrant
Size represents Price: USD per 1M characters of text
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.

Speed vs. Price

Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.

Quality Arena ELO (Text to Speech Arena)

Arena ELO: Average ELO rating of the model, Higher is better
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.

Arena Win Rate

Arena Win Rate: % Win rate in Text to Speech Arena, Higher is better
Win Rate: Proportion of time an audio clip generated by the model was selected as preferred compared to the other audio clip present in Artificial Analysis' Quality Arena.

 Participate in the Speech Arena to contribute to the crowdsourced quality evaluations

Characters Per Second

Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.

Speed Factor

Speed factor: Output audio seconds generated per second, Higher is better
Speed Factor: Output audio seconds generated per second. Higher values indicate faster generation speeds. Characters per second is generally preferred as a benchmark of API generation speed as there is variable output audio seconds per character depending on the model (e.g. slower speaking voice).

Characters Per Second, Variance

Characters processed per second: # of characters per second of generation time, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Boxplot: Shows variance of measurements
Picture of the author

Characters per Second, Over Time

Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Over time measurement: Median measurement per day, based on 4 measurements each day at different times. Labels represent start of week's measurements.

Price

Price: USD per 1M characters of text, Lower is better
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.

Streaming Support

ProviderStreaming Support
OpenAI logoOpenAI
Google logoGoogle
Amazon logoAmazon
Azure logoAzure
Replicate logoReplicate
ElevenLabs logoElevenLabs
LMNT logoLMNT
Streaming Support: Indicates whether the provider supports streaming of audio from their API. We plan to add performance benchmarking of streaming support in the future.
Summary of Key Metrics & Further Information
ProviderFurther
Details
HD, OpenAI TTS logoOpenAI
Multilingual v2, ElevenLabs logoElevenLabs
Turbo v2.5, ElevenLabs logoElevenLabs
Standard, OpenAI TTS logoOpenAI
Sonic English (Oct '24), Cartesia logoCartesia
Neural, Microsoft Azure logoMicrosoft Azure
Long-form, Amazon Polly logoAmazon Bedrock
Studio, Google Cloud TTS logoGoogle
Journey, Google Cloud TTS logoGoogle
LMNT logoLMNT
OpenVoice v2 logoReplicate
XTTS v2 logoReplicate
StyleTTS 2 logoReplicate
WaveNet, Google Cloud TTS logoGoogle
Neural, Amazon Polly logoAmazon Bedrock
Standard, Google Cloud TTS logoGoogle
Neural2, Google Cloud TTS logoGoogle
MetaVoice v1 logoReplicate
Standard, Amazon Polly logoAmazon Bedrock