Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Speech to Speech AI Model & Provider Leaderboard

Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.

For further details, see the methodology page.

Speech Reasoning
Speech Reasoning based on the Artificial Analysis Big Bench Audio dataset; Higher is better
Speed
Time to First Audio (seconds) on Big Bench Audio; Lower is better
Input Price
Price per Hour of Input Audio (USD); Lower is better

Speech Reasoning

Speech Reasoning (Big Bench Audio)

Speech Reasoning: Based on the Artificial Analysis Big Bench Audio dataset; Higher is better

Big Bench Audio is an Artificial Analysis benchmark comprising 1,000 audio questions adapted from Big Bench Hard, designed to test the reasoning ability of native audio models. Questions span four categories (250 each): Formal Fallacies (determine whether an argument is deductively valid or invalid), Navigate (determine whether navigation steps return to the starting point), Object Counting (count the number of a specific item class from a list of possessions), and Web of Lies (evaluate the truth value of a Boolean function expressed as a word problem). Models receive an audio input and generate an audio output, which is evaluated as correct or incorrect.

Conversational Dynamics

Conversational Dynamics (Full Duplex Bench subset)

Weighted average of pause handling, turn-taking, interruption handling, and backchannel handling from Full Duplex Bench v1 and v1.5; Higher is better

Metrics from Full Duplex Bench v1 include Pause Handling (% of samples model correctly does not interrupt) and Turn Taking (% of samples correctly takes the turn in conversation). Metrics from Full Duplex Bench v1.5 include User Interruption Handling (% of samples where the model correctly addresses user's interruption) and Backchannel Handling (% of samples where the model correctly continues when a backchannel such as "yeah" or "alright" is played).

Conversational Dynamics by Category

Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown

Individual category scores ordered by Conversational Dynamics score. Based on subset of Full Duplex Bench v1 and v1.5
Pause Handling
Turn Taking
User Interruption Handling
Backchannel Handling

Metrics from Full Duplex Bench v1 include Pause Handling (% of samples model correctly does not interrupt) and Turn Taking (% of samples correctly takes the turn in conversation). Metrics from Full Duplex Bench v1.5 include User Interruption Handling (% of samples where the model correctly addresses user's interruption) and Backchannel Handling (% of samples where the model correctly continues when a backchannel such as "yeah" or "alright" is played).

API Benchmarks

Speech Reasoning (Big Bench Audio) vs Input Price

Big Bench Audio score vs input price (USD per hour); Higher speech reasoning and lower price are better
Most attractive quadrant
Gemini 2.5 Flash Native Audio Dialog
Gemini 2.5 Flash Native Audio Dialog Thinking
Gemini 3.1 Flash Live Preview - High
Gemini 3.1 Flash Live Preview - Minimal
GPT Realtime
GPT Realtime Mini (Oct 2025)
GPT-4o audio chatcompletions
GPT-4o mini Realtime (Dec 2024)
GPT-Realtime-1.5
Grok Voice Agent
Nova 2.0 Sonic
Qwen3 Omni Flash
Qwen3 Omni Realtime
Step-Audio R1.1 (Realtime)

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Big Bench Audio is an Artificial Analysis benchmark comprising 1,000 audio questions adapted from Big Bench Hard, designed to test the reasoning ability of native audio models. Questions span four categories (250 each): Formal Fallacies (determine whether an argument is deductively valid or invalid), Navigate (determine whether navigation steps return to the starting point), Object Counting (count the number of a specific item class from a list of possessions), and Web of Lies (evaluate the truth value of a Boolean function expressed as a word problem). Models receive an audio input and generate an audio output, which is evaluated as correct or incorrect.

Input Price

Price per hour of input audio (USD); Lower is better

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Speed

Seconds to generate the first token of audio output on Big Bench Audio dataset; Lower is better

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Summary of key metrics & further information

Model NameFootnotesSpeech ReasoningConversational Dynamics scoreTime to first audio (secs)Price per hour of audio input (USD)Price per hour of audio output (USD)
StepFun logoStepFun
StepFun logoStep-Audio R1.1 (Realtime)
-
97%-1.510.061.69
Hathora logoHathora
Hathora logoQwen3 Omni Flash, Hathora
-
--2.80--
Alibaba Cloud logoAlibaba Cloud
Alibaba Cloud logoQwen3 Omni Flash
-
58%72.7%4.820.611.36
Alibaba Cloud logoQwen3 Omni Realtime
-
57%-0.880.650.99
NVIDIA logoNVIDIA
NVIDIA logoNemotron Voicechat
-
39%77.8%---
NVIDIA logoPersonaPlex
-
19%91.0%---
Amazon Bedrock logoAmazon Bedrock
Amazon Bedrock logoNova 2.0 Sonic
-
87%-1.390.271.08
Kyutai logoKyutai
Kyutai logoMoshi
-
4%61.0%---
xAI logoxAI
xAI logoGrok Voice Agent
-
93%71.6%0.783.003.00
OpenAI logoOpenAI
OpenAI logoGPT Realtime
-
83%93.9%0.981.154.61
OpenAI logoGPT-Realtime-1.5
-
81%95.7%0.821.154.61
OpenAI logoGPT-4o mini Realtime (Dec 2024)
-
69%-1.270.361.44
OpenAI logoGPT Realtime Mini (Oct 2025)
-
62%95.7%0.810.361.44
OpenAI logoGPT-4o audio chatcompletions
54%-3.383.6014.40
OpenAI logoGPT-4o Realtime (Dec 2024)
-
-89.8%1.491.445.76
Google logoGoogle
Google logoGemini 3.1 Flash Live Preview - High
-
96%-2.980.351.38
Google logoGemini 2.5 Flash Native Audio Dialog Thinking
-
91%-3.870.351.38
Google logoGemini 3.1 Flash Live Preview - Minimal
-
71%-0.960.351.38
Google logoGemini 2.5 Flash Native Audio Dialog
-
69%-0.630.351.38
Google logoGemini 2.5 Flash Native Audio Preview (Sep 2025)
-
-30.3%---
Google logoGemini 2.5 Flash Native Audio Preview (Dec 2025)
-
-44.0%---
VITA logoVITA
VITA logoFreeze-Omni
-
32%58.7%---
Cofe AI logoCofe AI
Cofe AI logoFLM-Audio
-
16%62.0%---

Frequently Asked Questions

Common questions about Speech to Speech models

Step-Audio R1.1 (Realtime) leads with a Speech Reasoning score of 97.0% on the Big Bench Audio dataset across 19 models evaluated. Gemini 3.1 Flash Live Preview - High follows at 95.9% and Grok Voice Agent at 92.9%.

GPT-Realtime-1.5 leads with a Conversational Dynamics score of 95.7% on the Full Duplex Bench dataset across 13 models evaluated. GPT Realtime Mini (Oct 2025) follows at 95.7% and GPT Realtime at 93.9%.

The top speech to speech models by reasoning quality are: 1. Step-Audio R1.1 (Realtime) (97.0%), 2. Gemini 3.1 Flash Live Preview - High (95.9%), 3. Grok Voice Agent (92.9%), 4. Gemini 2.5 Flash Native Audio Dialog Thinking (90.7%), 5. Nova 2.0 Sonic (86.6%). Scores are based on the Big Bench Audio benchmark.

The top speech to speech models by conversational dynamics are: 1. GPT-Realtime-1.5 (95.7%), 2. GPT Realtime Mini (Oct 2025) (95.7%), 3. GPT Realtime (93.9%), 4. PersonaPlex (91.0%), 5. GPT-4o Realtime (Dec 2024) (89.8%). Scores are based on the Full Duplex Bench benchmark.

Gemini 2.5 Flash Native Audio Dialog has the lowest Time to First Audio at 0.63s, followed by Grok Voice Agent (0.78s) and GPT Realtime Mini (Oct 2025) (0.81s). Lower values mean faster initial response.

Step-Audio R1.1 (Realtime) is the most affordable at $0.064/hour for input audio, followed by Nova 2.0 Sonic ($0.27/hour) and Gemini 3.1 Flash Live Preview - Minimal ($0.3456/hour).

Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.

Speech Reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational Dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.

Real-time conversation requires both strong conversational dynamics and low latency. By Conversational Dynamics score, GPT-Realtime-1.5 (95.7%), GPT Realtime Mini (Oct 2025) (95.7%), and GPT Realtime (93.9%) lead the field. By Time to First Audio, Gemini 2.5 Flash Native Audio Dialog (0.63s), Grok Voice Agent (0.78s), and GPT Realtime Mini (Oct 2025) (0.81s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.

Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.