OpenAI TTS: Quality, Generation Time & Price Analysis
Analysis of OpenAI's models and comparison to other audio models across key metrics including quality, generation time, and price.
API providers compared include OpenAI, Google, Amazon Bedrock, Microsoft Azure, Replicate, Cartesia, ElevenLabs, and LMNT.
For further details, see our methodology page.
Highlights
Quality ELO
Arena ELO: Average ELO rating of the model, Higher is better
Characters per Second
Characters processed per second: # of characters per second of generation time, Higher is better
Price
Price: USD per 1M characters of text, Lower is better
Summary Analysis
Quality vs. Price
Arena ELO: Average ELO rating of the model, Price: USD per 1M characters of text
Most attractive quadrant
Size represents Characters processed per second: # of characters per second of generation time
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Quality vs. Speed
Arena ELO: Average ELO rating of the model, Characters processed per second: # of characters per second of generation time
Most attractive quadrant
Size represents Price: USD per 1M characters of text
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Speed vs. Price
Characters processed per second: # of characters per second of generation time, Price: USD per 1M characters of text
Most attractive quadrant
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Quality
Quality Arena ELO (Text to Speech Arena)
Arena ELO: Average ELO rating of the model, Higher is better
Quality ELO: Relative ELO score of the models as determined by responses from users in Artificial Analysis' Quality Arena.Some models may not be shown due to not yet having enough votes.Note that this is intended to represent quality of generalist use-cases (conversational AI assistant, customer support system or reading an email) and may not be representative of all use-cases.
Arena Win Rate
Arena Win Rate: % Win rate in Text to Speech Arena, Higher is better
Win Rate: Proportion of time an audio clip generated by the model was selected as preferred compared to the other audio clip present in Artificial Analysis' Quality Arena.
 Participate in the Speech Arena to contribute to the crowdsourced quality evaluations
Speed
Characters Per Second
Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Speed Factor
Speed factor: Output audio seconds generated per second, Higher is better
Speed Factor: Output audio seconds generated per second. Higher values indicate faster generation speeds. Characters per second is generally preferred as a benchmark of API generation speed as there is variable output audio seconds per character depending on the model (e.g. slower speaking voice).
Characters Per Second, Variance
Characters processed per second: # of characters per second of generation time, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Boxplot:Â Shows variance of measurements
Characters per Second, Over Time
Characters processed per second: # of characters per second of generation time, Higher is better
Characters per Second: Number of characters processed per second of generation time. Higher values indicate faster generation speeds.
Over time measurement: Median measurement per day, based on 4 measurements each day at different times. Labels represent start of week's measurements.
Price
Price
Price: USD per 1M characters of text, Lower is better
Price: Price per 1M characters of text. For detail on how we calculate price for providers which price based on inference time or subscription plans, see our methodology page.
Streaming
Streaming Support
Provider | Streaming Support |
---|---|
OpenAI | |
Amazon | |
Azure | |
Replicate | |
ElevenLabs | |
LMNT |
Streaming Support: Indicates whether the provider supports streaming of audio from their API. We plan to add performance benchmarking of streaming support in the future.