Speech to Speech AI Model & Provider Leaderboard
Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.
For further details, see the methodology page.
Highlights
Speech Reasoning
Speech Reasoning (Big Bench Audio)
Agentic Performance
Agentic Performance (𝜏-Voice)
Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI
Agentic Performance (𝜏-Voice) by Domain
Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI
Conversational Dynamics
Conversational Dynamics (Full Duplex Bench subset)
Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown
API Benchmarks
Speech Reasoning (Big Bench Audio) vs. Cost per Hour of Input Audio
Cost per Hour of Input Audio (Big Bench Audio subset)
Time to First Audio
Summary of Key Metrics & Further Information
Footnotes | |||||||||
|---|---|---|---|---|---|---|---|---|---|
Fun-Realtime-Audiochat | - | 98% | 97.8% | - | 1.39 | - | - | - | |
Qwen3.5 Omni Plus Realtime | - | 73% | - | - | 1.26 | 0.00 | 0.16 | 0.82 | |
Qwen3.5 Omni Flash Realtime | - | 59% | - | - | 0.79 | 0.00 | 0.16 | 0.82 | |
Qwen3 Omni Flash | - | 59% | 72.7% | - | 4.82 | 1.77 | 0.61 | 1.36 | |
Qwen3 Omni Realtime | - | 57% | - | - | 0.88 | 2.26 | 0.65 | 0.99 | |
Nova 2.0 Sonic (Mar 2026) | - | 88% | - | - | 1.14 | - | 0.27 | 1.08 | |
FLM-Audio | - | 16% | 62.0% | - | - | - | - | - | |
Deepslate Opal | - | 59% | 85.7% | 17.5% | 0.71 | - | - | - | |
Gemini 3.1 Flash Live Preview - High | - | 97% | - | 37.7% | 2.98 | 1.75 | 0.35 | 1.38 | |
Gemini 2.5 Flash Native Audio Dialog Thinking | - | 91% | - | - | 3.87 | - | 0.35 | 1.38 | |
Gemini 3.1 Flash Live Preview - Minimal | - | 71% | - | 26.2% | 0.96 | 1.50 | 0.35 | 1.38 | |
Gemini 2.5 Flash Native Audio Dialog | - | 69% | - | - | 0.63 | 1.42 | 0.35 | 1.38 | |
Gemini 2.5 Flash Native Audio Preview (Sep 2025) | - | - | 30.3% | - | - | - | - | - | |
Gemini 2.5 Flash Native Audio Preview (Dec 2025) | - | - | 44.0% | 22.8% | - | - | - | - | |
Qwen3 Omni Flash, Hathora | - | - | - | - | 2.80 | - | - | - | |
Moshi | - | 4% | 61.0% | - | - | - | - | - | |
Nemotron Voicechat | - | 39% | 77.8% | - | - | - | - | - | |
PersonaPlex | - | 19% | 91.0% | - | - | - | - | - | |
GPT-Realtime-2 (High) | - | 97% | 95.3% | 39.8% | 2.33 | 4.14 | 1.15 | 4.61 | |
GPT-Realtime-2 (Medium) | - | 93% | 95.2% | 37.4% | 1.59 | 3.97 | 1.15 | 4.61 | |
GPT Realtime | - | 83% | 93.9% | 30.4% | 0.98 | 11.08 | 1.15 | 4.61 | |
GPT-Realtime-1.5 | - | 81% | 95.7% | 38.8% | 0.82 | 11.44 | 1.15 | 4.61 | |
GPT-Realtime-2 (Minimal) | - | 72% | 96.1% | 30.8% | 1.12 | 3.07 | 1.15 | 4.61 | |
GPT-4o mini Realtime (Dec 2024) | - | 69% | - | - | 1.27 | 5.75 | 0.36 | 1.44 | |
GPT Realtime Mini (Oct 2025) | - | 64% | 95.7% | 15.1% | 0.81 | 3.04 | 0.36 | 1.44 | |
GPT-4o audio chatcompletions | 54% | - | - | 3.38 | 0.00 | 3.60 | 14.40 | ||
GPT-4o Realtime (Dec 2024) | - | - | 89.8% | 27.9% | 1.49 | 2.04 | 1.44 | 5.76 | |
Step-Audio R1.1 (Realtime) | - | 98% | - | - | 1.51 | 0.00 | 0.06 | 1.69 | |
Freeze-Omni | - | 33% | 58.7% | - | - | - | - | - | |
Grok Voice Think Fast 1.0 | - | 97% | 77.8% | 52.1% | 1.25 | 3.00 | - | - | |
Grok Voice Agent | - | 93% | 71.6% | 27.4% | 0.78 | 3.00 | 3.00 | 3.00 |
Frequently Asked Questions
Fun-Realtime-Audiochat leads with a speech reasoning score of 97.6% on the Big Bench Audio dataset across 27 models evaluated. Step-Audio R1.1 (Realtime) follows at 97.6% and Grok Voice Think Fast 1.0 at 97.1%.
Fun-Realtime-Audiochat leads with a conversational dynamics score of 97.8% on the Full Duplex Bench dataset across 19 models evaluated. GPT-Realtime-2 (Minimal) follows at 96.1% and GPT-Realtime-1.5 at 95.7%.
The top speech to speech models by reasoning quality are: 1. Fun-Realtime-Audiochat (97.6%), 2. Step-Audio R1.1 (Realtime) (97.6%), 3. Grok Voice Think Fast 1.0 (97.1%), 4. GPT-Realtime-2 (High) (96.6%), 5. Gemini 3.1 Flash Live Preview - High (96.6%). Scores are based on the Big Bench Audio benchmark.
The top speech to speech models by conversational dynamics are: 1. Fun-Realtime-Audiochat (97.8%), 2. GPT-Realtime-2 (Minimal) (96.1%), 3. GPT-Realtime-1.5 (95.7%), 4. GPT Realtime Mini (Oct 2025) (95.7%), 5. GPT-Realtime-2 (High) (95.3%). Scores are based on the Full Duplex Bench benchmark.
Gemini 2.5 Flash Native Audio Dialog has the lowest Time to First Audio at 0.63s, followed by Deepslate Opal (0.71s) and Grok Voice Agent (0.78s). Lower values mean faster initial response.
Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Qwen3.5 Omni Plus Realtime ($0.162/hour) and Qwen3.5 Omni Flash Realtime ($0.162/hour).
Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.
Speech reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.
Real-time conversation requires both strong conversational dynamics and low latency. By conversational dynamics score, Fun-Realtime-Audiochat (97.8%), GPT-Realtime-2 (Minimal) (96.1%), and GPT-Realtime-1.5 (95.7%) lead the field. By Time to First Audio, Gemini 2.5 Flash Native Audio Dialog (0.63s), Deepslate Opal (0.71s), and Grok Voice Agent (0.78s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.
Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.