Speech to Speech AI Model & Provider Leaderboard
Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, generation time and price.
For further details, see the methodology page.
Speech Reasoning
Speech Reasoning (Big Bench Audio)
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
About Big Bench Audio
The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.
To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.
Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models . The questions are based on four categories of the Big Bench Hard dataset and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.
To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio as described in the table and displayed in the chart above. To learn more about Big Bench Audio, review the article or download the dataset yourself.
Summary Analysis
Speech Reasoning vs Input Price
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.
Speech Reasoning vs Speed
Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.
Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.
Price
Audio Input and Output Prices
Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.
Price per hour of audio generated by the model (received from the API), represented as USD per hour of audio.
Speed
Time to First Audio
Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.
Summary of key metrics & further information
| Model Name | Footnotes | Time to first audio (secs) | Price per hour of audio input (USD) | Price per hour of audio output (USD) | ||
|---|---|---|---|---|---|---|
- | 96% | 1.51 | 0.06 | 1.69 | ||
- | 59% | 0.88 | 0.65 | 0.99 | ||
- | 58% | 4.82 | 0.61 | 1.36 | ||
- | 87% | 1.39 | 0.27 | 1.08 | ||
- | 92% | 0.78 | 3.00 | 3.00 | ||
- | 83% | 0.98 | 1.15 | 4.61 | ||
- | 81% | 0.82 | 1.15 | 4.61 | ||
- | 69% | 1.27 | 0.36 | 1.44 | ||
- | 68% | 0.81 | 0.36 | 1.44 | ||
| 54% | 3.38 | 3.60 | 14.40 | |||
- | 92% | 3.87 | 0.35 | 1.38 | ||
- | 71% | 0.63 | 0.35 | 1.38 | ||
Frequently Asked Questions
Common questions about Speech to Speech models
Gemini 2.5 Flash Native Audio Dialog has the lowest Time to First Audio at 0.63s, followed by Grok Voice Agent (0.78s) and GPT-Realtime Mini, Oct 25 (0.81s). Lower values mean faster initial response.
Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Nova 2.0 Sonic ($0.27/hour) and Gemini 2.5 Flash Native Audio Dialog ($0.3456/hour).
Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.