Speech to Speech AI Model & Provider Leaderboard
Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.
For further details, see the methodology page.
Speech Reasoning
Speech Reasoning (Big Bench Audio)
Conversational Dynamics
Conversational Dynamics (Full Duplex Bench subset)
Conversational Dynamics by Category
Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown
API Benchmarks
Speech Reasoning (Big Bench Audio) vs Input Price
Input Price
Speed
Summary of key metrics & further information
| Model Name | Footnotes | Speech Reasoning | Conversational Dynamics score | Time to first audio (secs) | Price per hour of audio input (USD) | Price per hour of audio output (USD) |
|---|---|---|---|---|---|---|
- | 97% | - | 1.51 | 0.06 | 1.69 | |
- | - | - | 2.80 | - | - | |
- | 58% | 72.7% | 4.82 | 0.61 | 1.36 | |
- | 57% | - | 0.88 | 0.65 | 0.99 | |
- | 39% | 77.8% | - | - | - | |
- | 19% | 91.0% | - | - | - | |
- | 87% | - | 1.39 | 0.27 | 1.08 | |
- | 4% | 61.0% | - | - | - | |
- | 93% | 71.6% | 0.78 | 3.00 | 3.00 | |
- | 83% | 93.9% | 0.98 | 1.15 | 4.61 | |
- | 81% | 95.7% | 0.82 | 1.15 | 4.61 | |
- | 69% | - | 1.27 | 0.36 | 1.44 | |
- | 62% | 95.7% | 0.81 | 0.36 | 1.44 | |
| 54% | - | 3.38 | 3.60 | 14.40 | ||
- | - | 89.8% | 1.49 | 1.44 | 5.76 | |
- | 96% | - | 2.98 | 0.35 | 1.38 | |
- | 91% | - | 3.87 | 0.35 | 1.38 | |
- | 71% | - | 0.96 | 0.35 | 1.38 | |
- | 69% | - | 0.63 | 0.35 | 1.38 | |
- | - | 30.3% | - | - | - | |
- | - | 44.0% | - | - | - | |
- | 32% | 58.7% | - | - | - | |
- | 16% | 62.0% | - | - | - | |
Frequently Asked Questions
Common questions about Speech to Speech models
Step-Audio R1.1 (Realtime) leads with a Speech Reasoning score of 97.0% on the Big Bench Audio dataset across 19 models evaluated. Gemini 3.1 Flash Live Preview - High follows at 95.9% and Grok Voice Agent at 92.9%.
GPT-Realtime-1.5 leads with a Conversational Dynamics score of 95.7% on the Full Duplex Bench dataset across 13 models evaluated. GPT Realtime Mini (Oct 2025) follows at 95.7% and GPT Realtime at 93.9%.
The top speech to speech models by reasoning quality are: 1. Step-Audio R1.1 (Realtime) (97.0%), 2. Gemini 3.1 Flash Live Preview - High (95.9%), 3. Grok Voice Agent (92.9%), 4. Gemini 2.5 Flash Native Audio Dialog Thinking (90.7%), 5. Nova 2.0 Sonic (86.6%). Scores are based on the Big Bench Audio benchmark.
The top speech to speech models by conversational dynamics are: 1. GPT-Realtime-1.5 (95.7%), 2. GPT Realtime Mini (Oct 2025) (95.7%), 3. GPT Realtime (93.9%), 4. PersonaPlex (91.0%), 5. GPT-4o Realtime (Dec 2024) (89.8%). Scores are based on the Full Duplex Bench benchmark.
Gemini 2.5 Flash Native Audio Dialog has the lowest Time to First Audio at 0.63s, followed by Grok Voice Agent (0.78s) and GPT Realtime Mini (Oct 2025) (0.81s). Lower values mean faster initial response.
Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Nova 2.0 Sonic ($0.27/hour) and Gemini 3.1 Flash Live Preview - Minimal ($0.3456/hour).
Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.
Speech Reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational Dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.
Real-time conversation requires both strong conversational dynamics and low latency. By Conversational Dynamics score, GPT-Realtime-1.5 (95.7%), GPT Realtime Mini (Oct 2025) (95.7%), and GPT Realtime (93.9%) lead the field. By Time to First Audio, Gemini 2.5 Flash Native Audio Dialog (0.63s), Grok Voice Agent (0.78s), and GPT Realtime Mini (Oct 2025) (0.81s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.
Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.