Speech to Speech AI Model & Provider Leaderboard
Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.
For further details, see the methodology page.
Highlights
Speech Reasoning
Speech Reasoning (Big Bench Audio)
Agentic Performance
Agentic Performance (𝜏-Voice)
Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI; Following models based on 2 trials: GPT-Realtime-2 (High), OpenAI
Agentic Performance (𝜏-Voice) by Domain
Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI; Following models based on 2 trials: GPT-Realtime-2 (High), OpenAI
Conversational Dynamics
Conversational Dynamics (Full Duplex Bench subset)
Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown
API Benchmarks
Speech Reasoning (Big Bench Audio) vs. Input Price
Input Price
Time to First Audio
Summary of Key Metrics & Further Information
| Model Name | Footnotes | Speech Reasoning | Conversational Dynamics | Agentic Performance | Time to first audio (secs) | Price per hour of audio input (USD) | Price per hour of audio output (USD) |
|---|---|---|---|---|---|---|---|
- | 98% | - | - | 1.51 | 0.06 | 1.69 | |
- | 73% | - | - | 1.26 | 0.16 | 0.82 | |
- | 59% | - | - | 0.79 | 0.16 | 0.82 | |
- | 59% | 72.7% | - | 4.82 | 0.61 | 1.36 | |
- | 57% | - | - | 0.88 | 0.65 | 0.99 | |
- | - | - | - | 2.80 | - | - | |
- | 39% | 77.8% | - | - | - | - | |
- | 19% | 91.0% | - | - | - | - | |
- | 88% | - | - | 1.14 | 0.27 | 1.08 | |
- | 4% | 61.0% | - | - | - | - | |
- | 97% | 77.8% | 52.1% | 1.25 | - | - | |
- | 93% | 71.6% | 27.4% | 0.78 | 3.00 | 3.00 | |
- | 97% | 95.3% | 38.3% | 2.33 | 1.15 | 4.61 | |
- | 93% | 95.2% | 37.4% | 1.59 | 1.15 | 4.61 | |
- | 83% | 93.9% | 30.4% | 0.98 | 1.15 | 4.61 | |
- | 81% | 95.7% | 38.8% | 0.82 | 1.15 | 4.61 | |
- | 72% | 96.1% | 30.8% | 1.12 | 1.15 | 4.61 | |
- | 69% | - | - | 1.27 | 0.36 | 1.44 | |
- | 64% | 95.7% | 15.1% | 0.81 | 0.36 | 1.44 | |
| 54% | - | - | 3.38 | 3.60 | 14.40 | ||
- | - | 89.8% | 27.9% | 1.49 | 1.44 | 5.76 | |
- | 97% | - | 37.7% | 2.98 | 0.35 | 1.38 | |
- | 91% | - | - | 3.87 | 0.35 | 1.38 | |
- | 71% | - | 26.2% | 0.96 | 0.35 | 1.38 | |
- | 69% | - | - | 0.63 | 0.35 | 1.38 | |
- | - | 30.3% | - | - | - | - | |
- | - | 44.0% | 22.8% | - | - | - | |
- | 33% | 58.7% | - | - | - | - | |
- | 16% | 62.0% | - | - | - | - | |
Frequently Asked Questions
Step-Audio R1.1 (Realtime) leads with a speech reasoning score of 97.6% on the Big Bench Audio dataset across 25 models evaluated. Grok Voice Think Fast 1.0 follows at 97.1% and GPT-Realtime-2 (High) at 96.6%.
GPT-Realtime-2 (Minimal) leads with a conversational dynamics score of 96.1% on the Full Duplex Bench dataset across 17 models evaluated. GPT-Realtime-1.5 follows at 95.7% and GPT Realtime Mini (Oct 2025) at 95.7%.
The top speech to speech models by reasoning quality are: 1. Step-Audio R1.1 (Realtime) (97.6%), 2. Grok Voice Think Fast 1.0 (97.1%), 3. GPT-Realtime-2 (High) (96.6%), 4. Gemini 3.1 Flash Live Preview - High (96.6%), 5. Grok Voice Agent (93.3%). Scores are based on the Big Bench Audio benchmark.
The top speech to speech models by conversational dynamics are: 1. GPT-Realtime-2 (Minimal) (96.1%), 2. GPT-Realtime-1.5 (95.7%), 3. GPT Realtime Mini (Oct 2025) (95.7%), 4. GPT-Realtime-2 (High) (95.3%), 5. GPT-Realtime-2 (Medium) (95.2%). Scores are based on the Full Duplex Bench benchmark.
Gemini 2.5 Flash Native Audio Dialog has the lowest Time to First Audio at 0.63s, followed by Grok Voice Agent (0.78s) and Qwen3.5 Omni Flash Realtime (0.79s). Lower values mean faster initial response.
Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Qwen3.5 Omni Plus Realtime ($0.162/hour) and Qwen3.5 Omni Flash Realtime ($0.162/hour).
Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.
Speech reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.
Real-time conversation requires both strong conversational dynamics and low latency. By conversational dynamics score, GPT-Realtime-2 (Minimal) (96.1%), GPT-Realtime-1.5 (95.7%), and GPT Realtime Mini (Oct 2025) (95.7%) lead the field. By Time to First Audio, Gemini 2.5 Flash Native Audio Dialog (0.63s), Grok Voice Agent (0.78s), and Qwen3.5 Omni Flash Realtime (0.79s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.
Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.