Text to Speech AI Model & Provider Leaderboard
Analysis and comparison of Text to Speech generation models & API providers. Artificial Analysis has analyzed text to speech models and hosting providers across quality, generation time, and price. For further details, see our methodology page.
Text to speech models & providers compared: TTS-1, TTS-1 HD, Studio, Journey, Neural2, WaveNet, Standard, Polly Long-Form, Polly Neural, Polly Standard, Azure Neural, MetaVoice v1, XTTS v2, StyleTTS 2, OpenVoice v2, Sonic English (Oct '24), 3.0 mini, Turbo v2.5, Multilingual v2, T2A-01-HD, T2A-01-Turbo, Zonos-v0.1, Kokoro 82M v1.0, Polly Generative, Flash v2.5, Dialog, Murf Speech Gen 2, and Step TTS Mini.
Highlights
Quality ELO
Arena ELO: Average ELO rating of the model, Higher is better
Characters per Second
Characters processed per second: # of characters per second of generation time, Higher is better
Price
Price: USD per 1M characters of text, Lower is better
Summary Analysis
Quality vs. Price
Arena ELO: Average ELO rating of the model, Price: USD per 1M characters of text
Most attractive quadrant
Size represents Characters processed per second: # of characters per second of generation time
TTS-1
TTS-1 HD
Studio
WaveNet
Polly Long-form
Azure Neural
MetaVoice v1, Replicate
OpenVoice v2, Replicate
Sonic English (Oct '24), Cartesia
Multilingual v2
Zonos v0.1, Zyphra
Kokoro 82M v1.0, Replicate
Flash v2.5
Dialog
Step TTS Mini
PlayAI Dialog 1.0, Groq
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena. Some models may not be shown due to not yet having enough votes. Note that this is intended to represent quality of generalist use cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use cases.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Quality vs. Speed
Arena ELO: Average ELO rating of the model, Characters processed per second: # of characters per second of generation time
Most attractive quadrant
Size represents Price: USD per 1M characters of text
TTS-1
TTS-1 HD
Studio
WaveNet
Polly Long-form
Azure Neural
MetaVoice v1, Replicate
OpenVoice v2, Replicate
Sonic English (Oct '24), Cartesia
Multilingual v2
Zonos v0.1, Zyphra
Kokoro 82M v1.0, Replicate
Flash v2.5
Dialog
Step TTS Mini
PlayAI Dialog 1.0, Groq
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena. Some models may not be shown due to not yet having enough votes. Note that this is intended to represent quality of generalist use cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use cases.
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Speed vs. Price
Characters processed per second: # of characters per second of generation time, Price: USD per 1M characters of text
Most attractive quadrant
TTS-1
TTS-1 HD
Studio
WaveNet
Polly Long-form
Azure Neural
MetaVoice v1, Replicate
OpenVoice v2, Replicate
Sonic English (Oct '24), Cartesia
Multilingual v2
Zonos v0.1, Zyphra
Kokoro 82M v1.0, Replicate
Flash v2.5
Dialog
Step TTS Mini
PlayAI Dialog 1.0, Groq
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Quality
Quality Arena ELO (Text to Speech Arena)
Arena ELO: Average ELO rating of the model, Higher is better
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena. Some models may not be shown due to not yet having enough votes. Note that this is intended to represent quality of generalist use cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use cases.
Arena Win Rate
Arena Win Rate: % Win rate in Text to Speech Arena, Higher is better
Win Rate: Proportion of time an audio clip generated by the model was selected as preferred compared to the other audio clip present in Artificial Analysis' Quality Arena.
Participate in the Speech Arena to contribute to the crowdsourced quality evaluations
Speed
Characters Per Second
Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Characters Per Second, Variance
Characters processed per second: # of characters per second of generation time, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Boxplot: Shows variance of measurements

Characters per Second, Over Time
Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Over time measurement: Median measurement per day, based on 4 measurements each day at different times. Labels represent start of week's measurements.
Price
Price
Price: USD per 1M characters of text, Lower is better
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Streaming
Streaming Support
Provider | Streaming Support |
---|---|
![]() | |
![]() | |
![]() | |
![]() | |
![]() | |
![]() |
Streaming Support: Indicates whether the provider supports streaming of audio from their API. We plan to add performance benchmarking of streaming support in the future.
Summary of Key Metrics & Further Information
Provider | Model | Streaming support | Footnotes | Model Arena ELO | Characters per Second | Price per 1M Characters (USD) | Further Details |
---|---|---|---|---|---|---|---|
TTS-1 HD | 1152 | 528.9 | $30.00 | ||||
TTS-1 | 1138 | 550.0 | $15.00 | ||||
![]() | Multilingual v2 | 1115 | 82.4 | $206.00 | |||
![]() | Turbo v2.5 | 1111 | 389.1 | $103.00 | |||
![]() | Flash v2.5 | 1108 | 356.1 | $103.00 | |||
Sonic English (Oct '24) | 1107 | 38.5 | $46.70 | ||||
Kokoro 82M v1.0 | 1090 | 341.1 | $0.65 | ||||
![]() | T2A-01-HD | 1081 | 140.0 | $50.00 | |||
![]() | Polly Generative | 1062 | 88.2 | $30.00 | |||
Azure Neural | 1059 | 289.2 | $15.00 | ||||
![]() | Polly Long-Form | 1059 | 345.2 | $100.00 | |||
![]() | T2A-01-Turbo | 1042 | 167.4 | $30.00 | |||
Studio | 1040 | 276.1 | $160.00 | ||||
![]() | Dialog | 1015 | 78.6 | $150.00 | |||
![]() | Dialog | 1015 | 220.6 | $50.00 | |||
![]() | Zonos-v0.1 | 1000 | 25.5 | $20.00 | |||
![]() | 3.0 mini | 995 | 78.9 | $150.00 | |||
OpenVoice v2 | 973 | 6.9 | $8.33 | ||||
![]() | Murf Speech Gen 2 | 973 | 142.9 | $100.00 | |||
![]() | Step TTS Mini | 960 | 42.0 | $12.38 | |||
Journey | 954 | 63.3 | $160.00 | ||||
XTTS v2 | 899 | 36.4 | $40.44 | ||||
StyleTTS 2 | 890 | 3.6 | $2.82 | ||||
![]() | Polly Neural | 884 | 885.8 | $16.00 | |||
WaveNet | 871 | 474.6 | $16.00 | ||||
Standard | 835 | 469.3 | $4.00 | ||||
Neural2 | 832 | 563.2 | $16.00 | ||||
![]() | Polly Standard | 797 | 1045.8 | $4.00 | |||
MetaVoice v1 | 784 | 2.4 | $123.97 |