Speech to Speech AI Model & Provider Leaderboard
Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, generation time and price.
For further details, see the methodology page.
Highlights
Speech Reasoning
Speech Reasoning (Big Bench Audio)
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
About Big Bench Audio
The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.
To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.
Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models . The questions are based on four categories of the Big Bench Hard dataset and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.
To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio as described in the table and displayed in the chart above. To learn more about Big Bench Audio, review the article or download the dataset yourself.
Summary Analysis
Speech Reasoning vs Input Price
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.
Speech Reasoning vs Speed
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.
Price
Audio Input and Output Prices
Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.
Price per hour of audio generated by the model (received from the API), represented as USD per hour of audio.
Speed
Time to First Audio
Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.
Summary of key metrics & further information
| Model Name | Footnotes | Time to first audio (secs) | Price per hour of audio input (USD) | Price per hour of audio output (USD) | ||
|---|---|---|---|---|---|---|
- | 59% | 0.88 | 0.65 | 0.99 | ||
- | 58% | 4.82 | 0.61 | 1.36 | ||
- | 87% | 1.39 | 0.27 | 1.08 | ||
- | 92% | 0.78 | 3.00 | 3.00 | ||
- | 83% | 0.98 | 1.15 | 4.61 | ||
- | 69% | 1.27 | 0.36 | 1.44 | ||
- | 68% | 1.49 | 1.44 | 5.76 | ||
- | 68% | 0.81 | 0.36 | 1.44 | ||
- | 66% | 1.04 | 3.60 | 14.40 | ||
| 54% | 3.38 | 3.60 | 14.40 | |||
- | 92% | 3.87 | 0.35 | 1.38 | ||
- | 74% | 0.64 | 0.35 | 1.38 | ||
- | 71% | 0.63 | 0.35 | 1.38 | ||
- | 36% | 1.94 | 0.03 | - | ||