Speech to Speech AI Model & Provider Leaderboard

Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.

For further details, see the methodology page.

Highlights

Weighted average of Speech Reasoning, Conversational Dynamics, and Agentic Performance results · Higher is better
Time to first audio (seconds) on Big Bench Audio · Lower is better
Cost to complete fixed 40-question Big Bench Audio subset based on length of input audio, normalized to hourly basis · Lower is better

Artificial Analysis Speech to Speech Index

Artificial Analysis Speech to Speech Index

Weighted average of Speech Reasoning, Conversational Dynamics, and Agentic Performance results · Only models with all data available shown · Higher is better

Weighted average score for native audio models across Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench), and Agentic Performance (𝜏-Voice). Models must have results for all three datasets to receive an index score.

Speech Reasoning

Speech Reasoning (Big Bench Audio)

Speech reasoning: based on the Artificial Analysis Big Bench Audio dataset · Higher is better

Big Bench Audio is an Artificial Analysis benchmark comprising 1,000 audio questions adapted from Big Bench Hard, designed to test the reasoning ability of native audio models. Questions span four categories (250 each): Formal Fallacies (determine whether an argument is deductively valid or invalid), Navigate (determine whether navigation steps return to the starting point), Object Counting (count the number of a specific item class from a list of possessions), and Web of Lies (evaluate the truth value of a Boolean function expressed as a word problem). Models receive an audio input and generate an audio output, which is evaluated as correct or incorrect.

Agentic Performance

Agentic Performance (𝜏-Voice)

Proportion of replica customer service scenarios resolved while acting as a customer support agent, based on the 𝜏-Voice benchmark · Higher is better · Only full duplex models

𝜏-Voice measures task completion in real-world replica customer service scenarios. A voice user simulator calls the tested model with a specific domain problem (e.g., requesting a flight change, disputing a retail charge, resolving a telecom issue) across domains inherited from 𝜏-bench and 𝜏²-bench. The tested model is prompted as a customer support agent with access to domain-specific tools and a policy document. Each task has a single valid database end-state; evaluation compares final state against this ground truth. Charts use average performance across 3 trials where available. For more detail on the original benchmark see the 𝜏-Voice paper.

Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI

Agentic Performance (𝜏-Voice) by Domain

Proportion of replica customer service scenarios resolved by domain, based on the 𝜏-Voice benchmark · Higher is better · Only full duplex models

𝜏-Voice measures task completion in real-world replica customer service scenarios. A voice user simulator calls the tested model with a specific domain problem (e.g., requesting a flight change, disputing a retail charge, resolving a telecom issue) across domains inherited from 𝜏-bench and 𝜏²-bench. The tested model is prompted as a customer support agent with access to domain-specific tools and a policy document. Each task has a single valid database end-state; evaluation compares final state against this ground truth. Charts use average performance across 3 trials where available. For more detail on the original benchmark see the 𝜏-Voice paper.

Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI

Conversational Dynamics

Conversational Dynamics (Full Duplex Bench subset)

Weighted average of pause handling, turn-taking, interruption handling, and backchannel handling from Full Duplex Bench v1 and v1.5 · Higher is better

Metrics from Full Duplex Bench v1 include Pause Handling (% of samples model correctly does not interrupt) and Turn Taking (% of samples correctly takes the turn in conversation). Metrics from Full Duplex Bench v1.5 include User Interruption Handling (% of samples where the model correctly addresses user's interruption) and Backchannel Handling (% of samples where the model correctly continues when a backchannel such as "yeah" or "alright" is played).

Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown

Individual category scores ordered by conversational dynamics score · Based on subset of Full Duplex Bench v1 and v1.5

Metrics from Full Duplex Bench v1 include Pause Handling (% of samples model correctly does not interrupt) and Turn Taking (% of samples correctly takes the turn in conversation). Metrics from Full Duplex Bench v1.5 include User Interruption Handling (% of samples where the model correctly addresses user's interruption) and Backchannel Handling (% of samples where the model correctly continues when a backchannel such as "yeah" or "alright" is played).

API Benchmarks

Artificial Analysis Speech to Speech Index vs. Cost per Hour of Input Audio

Artificial Analysis Speech to Speech Index vs. cost per hour of input audio · Only models with all data available shown · Higher index and lower cost are better
Most attractive quadrant
Deepslate
Google
OpenAI
xAI

Weighted average score for native audio models across Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench), and Agentic Performance (𝜏-Voice). Models must have results for all three datasets to receive an index score.

Cost to complete a fixed 40-question Big Bench Audio subset, normalized to an hourly basis using input audio duration. It incorporates audio input, audio output, text input, text output, and separately exposed reasoning/thinking tokens. It does not include cached-token discounts or tool-call costs.

Cost per Hour of Input Audio (Big Bench Audio subset)

Cost to complete fixed 40-question Big Bench Audio subset based on length of input audio, normalized to hourly basis · Lower is better

Cost to complete a fixed 40-question Big Bench Audio subset, normalized to an hourly basis using input audio duration. It incorporates audio input, audio output, text input, text output, and separately exposed reasoning/thinking tokens. It does not include cached-token discounts or tool-call costs.

Time to First Audio

Average time to first audio (seconds) on Big Bench Audio dataset · Lower is better

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Summary of Key Metrics & Further Information

Alibaba Cloud logoAlibaba Cloud
Fun-Realtime-Audiochat
-
98%
97.8%
-
1.39
-
-
-
Alibaba Cloud logoAlibaba Cloud
Qwen3.5 Omni Plus Realtime
-
73%
-
-
1.26
0.00
0.16
0.82
Alibaba Cloud logoAlibaba Cloud
Qwen3.5 Omni Flash Realtime
-
59%
-
-
0.79
0.00
0.16
0.82
Alibaba Cloud logoAlibaba Cloud
Qwen3 Omni Flash
-
59%
72.7%
-
4.82
1.77
0.61
1.36
Alibaba Cloud logoAlibaba Cloud
Qwen3 Omni Realtime
-
57%
-
-
0.88
2.26
0.65
0.99
Amazon Bedrock logoAmazon Bedrock
Nova 2.0 Sonic (Mar 2026)
-
88%
-
-
1.14
-
0.27
1.08
Cofe AI logoCofe AI
FLM-Audio
-
16%
62.0%
-
-
-
-
-
Deepslate logoDeepslate
Deepslate Opal
62.8%
85%
85.7%
17.5%
0.44
6.48
-
-
Google logoGoogle
Gemini 3.1 Flash - High
69.5%
97%
74.3%
37.7%
2.98
1.75
0.35
1.38
Google logoGoogle
Gemini 2.5 Flash Native Audio Dialog Thinking
-
91%
-
-
3.87
-
0.35
1.38
Google logoGoogle
Gemini 3.1 Flash - Minimal
56.6%
71%
72.3%
26.2%
0.96
1.50
0.35
1.38
Google logoGoogle
Gemini 2.5 Flash Native Audio Dialog
-
69%
-
-
0.63
1.42
0.35
1.38
Google logoGoogle
Gemini 2.5 Flash Native Audio Preview (Dec 2025)
-
-
44.0%
22.8%
-
-
-
-
Google logoGoogle
Gemini 2.5 Flash Native Audio Preview (Sep 2025)
-
-
30.3%
-
-
-
-
-
Hathora logoHathora
Qwen3 Omni Flash, Hathora
-
-
-
-
2.80
-
-
-
Kyutai logoKyutai
Moshi
-
4%
61.0%
-
-
-
-
-
NVIDIA logoNVIDIA
Nemotron Voicechat
-
39%
77.8%
-
-
-
-
-
NVIDIA logoNVIDIA
PersonaPlex
-
19%
91.0%
-
-
-
-
-
OpenAI logoOpenAI
GPT-Realtime-2 (High)
77.2%
97%
95.3%
39.8%
2.33
4.14
1.15
4.61
OpenAI logoOpenAI
GPT-Realtime-2 (Medium)
75.3%
93%
95.2%
37.4%
1.59
3.97
1.15
4.61
OpenAI logoOpenAI
GPT Realtime
69.2%
83%
93.9%
30.4%
0.98
11.08
1.15
4.61
OpenAI logoOpenAI
GPT-Realtime-1.5
72.0%
81%
95.7%
38.8%
0.82
11.44
1.15
4.61
OpenAI logoOpenAI
GPT-Realtime-2 (Minimal)
66.2%
72%
96.1%
30.8%
1.12
3.07
1.15
4.61
OpenAI logoOpenAI
GPT-4o mini Realtime (Dec 2024)
-
69%
-
-
1.27
5.75
0.36
1.44
OpenAI logoOpenAI
GPT Realtime Mini (Oct 2025)
58.1%
64%
95.7%
15.1%
0.81
3.04
0.36
1.44
OpenAI logoOpenAI
GPT-4o audio chatcompletions
-
54%
-
-
3.38
0.00
3.60
14.40
OpenAI logoOpenAI
GPT-4o Realtime (Dec 2024)
-
-
89.8%
27.9%
1.49
2.04
1.44
5.76
StepFun logoStepFun
Step-Audio R1.1 (Realtime)
-
98%
-
-
1.51
0.00
0.06
1.69
VITA logoVITA
Freeze-Omni
-
33%
58.7%
-
-
-
-
-
xAI logoxAI
Grok Voice Think Fast 1.0
75.7%
97%
77.8%
52.1%
1.25
3.00
-
-
xAI logoxAI
Grok Voice Agent
64.1%
93%
71.6%
27.4%
0.78
3.00
3.00
3.00

Frequently Asked Questions

Fun-Realtime-Audiochat leads with a speech reasoning score of 97.6% on the Big Bench Audio dataset across 27 models evaluated. Step-Audio R1.1 (Realtime) follows at 97.6% and Grok Voice Think Fast 1.0 at 97.1%.

Fun-Realtime-Audiochat leads with a conversational dynamics score of 97.8% on the Full Duplex Bench dataset across 21 models evaluated. GPT-Realtime-2 (Minimal) follows at 96.1% and GPT-Realtime-1.5 at 95.7%.

The top speech to speech models by reasoning quality are: 1. Fun-Realtime-Audiochat (97.6%), 2. Step-Audio R1.1 (Realtime) (97.6%), 3. Grok Voice Think Fast 1.0 (97.1%), 4. GPT-Realtime-2 (High) (96.6%), 5. Gemini 3.1 Flash - High (96.6%). Scores are based on the Big Bench Audio benchmark.

The top speech to speech models by conversational dynamics are: 1. Fun-Realtime-Audiochat (97.8%), 2. GPT-Realtime-2 (Minimal) (96.1%), 3. GPT-Realtime-1.5 (95.7%), 4. GPT Realtime Mini (Oct 2025) (95.7%), 5. GPT-Realtime-2 (High) (95.3%). Scores are based on the Full Duplex Bench benchmark.

Deepslate Opal has the lowest Time to First Audio at 0.44s, followed by Gemini 2.5 Flash Native Audio Dialog (0.63s) and Grok Voice Agent (0.78s). Lower values mean faster initial response.

Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Qwen3.5 Omni Plus Realtime ($0.162/hour) and Qwen3.5 Omni Flash Realtime ($0.162/hour).

Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.

Speech reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.

Real-time conversation requires both strong conversational dynamics and low latency. By conversational dynamics score, Fun-Realtime-Audiochat (97.8%), GPT-Realtime-2 (Minimal) (96.1%), and GPT-Realtime-1.5 (95.7%) lead the field. By Time to First Audio, Deepslate Opal (0.44s), Gemini 2.5 Flash Native Audio Dialog (0.63s), and Grok Voice Agent (0.78s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.

Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.