Speech to Speech AI Model & Provider Leaderboard

Name: AA-Speech to Speech Index
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, conversational dynamics, generation time and price.

For further details, see the methodology page.

Text to Speech Leaderboard

Speech to Text Leaderboard

Highlights

AA-Speech to Speech Index

Weighted average of Speech Reasoning, Conversational Dynamics, and Agentic Performance results · Higher is better

Speed

Time to first audio (seconds) on Big Bench Audio · Lower is better

Cost per Hour of Input Audio

Cost to complete fixed 40-question Big Bench Audio subset based on length of input audio, normalized to hourly basis · Lower is better

Artificial Analysis Speech to Speech Index

Weighted average of Speech Reasoning, Conversational Dynamics, and Agentic Performance results · Only models with all data available shown · Higher is better

Weighted average score for native audio models across Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench), and Agentic Performance (𝜏-Voice). Models must have results for all three datasets to receive an index score.

Speech Reasoning

Speech Reasoning (Big Bench Audio)

Speech reasoning: based on the Artificial Analysis Big Bench Audio dataset · Higher is better

Big Bench Audio is an Artificial Analysis benchmark comprising 1,000 audio questions adapted from Big Bench Hard, designed to test the reasoning ability of native audio models. Questions span four categories (250 each): Formal Fallacies (determine whether an argument is deductively valid or invalid), Navigate (determine whether navigation steps return to the starting point), Object Counting (count the number of a specific item class from a list of possessions), and Web of Lies (evaluate the truth value of a Boolean function expressed as a word problem). Models receive an audio input and generate an audio output, which is evaluated as correct or incorrect.

Agentic Performance

Agentic Performance (𝜏-Voice)

Proportion of replica customer service scenarios resolved while acting as a customer support agent, based on the 𝜏-Voice benchmark · Higher is better · Only full duplex models

𝜏-Voice measures task completion in real-world replica customer service scenarios. A voice user simulator calls the tested model with a specific domain problem (e.g., requesting a flight change, disputing a retail charge, resolving a telecom issue) across domains inherited from 𝜏-bench and 𝜏²-bench. The tested model is prompted as a customer support agent with access to domain-specific tools and a policy document. Each task has a single valid database end-state; evaluation compares final state against this ground truth. Charts use average performance across 3 trials where available. For more detail on the original benchmark see the 𝜏-Voice paper.

Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI; Following models based on 2 trials: GPT-Realtime-2.1 High, OpenAI, OpenAI

Agentic Performance (𝜏-Voice) by Domain

Proportion of replica customer service scenarios resolved by domain, based on the 𝜏-Voice benchmark · Higher is better · Only full duplex models

Note: Following models based on 1 trial: GPT-Realtime-2 (Minimal), OpenAI; Following models based on 2 trials: GPT-Realtime-2.1 High, OpenAI, OpenAI

Conversational Dynamics

Conversational Dynamics (Full Duplex Bench subset)

Weighted average of pause handling, turn-taking, interruption handling, and backchannel handling from Full Duplex Bench v1 and v1.5 · Higher is better

Metrics from Full Duplex Bench v1 include Pause Handling (% of samples model correctly does not interrupt) and Turn Taking (% of samples correctly takes the turn in conversation). Metrics from Full Duplex Bench v1.5 include User Interruption Handling (% of samples where the model correctly addresses user's interruption) and Backchannel Handling (% of samples where the model correctly continues when a backchannel such as "yeah" or "alright" is played).

Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown

Individual category scores ordered by conversational dynamics score · Based on subset of Full Duplex Bench v1 and v1.5

API Benchmarks

Artificial Analysis Speech to Speech Index vs. Cost per Hour of Input Audio

Artificial Analysis Speech to Speech Index vs. cost per hour of input audio · Only models with all data available shown · Higher index and lower cost are better

Most attractive quadrant

Cost per Hour of Input Audio (Big Bench Audio subset)

Cost to complete fixed 40-question Big Bench Audio subset based on length of input audio, normalized to hourly basis · Lower is better

Cost to complete a fixed 40-question Big Bench Audio subset, normalized to an hourly basis using input audio duration. It incorporates audio input, audio output, text input, text output, and separately exposed reasoning/thinking tokens. It does not include cached-token discounts or tool-call costs.

Time to First Audio

Average time to first audio (seconds) on Big Bench Audio dataset · Lower is better

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Summary of Key Metrics & Further Information


Alibaba Cloud	Qwen Audio 3.0 Realtime Plus, Alibaba Cloud	84.1%	99%	98.4%	54.6%	4.02	4.42	0.03	0.18
Alibaba Cloud	Qwen3.5 Omni Plus Realtime	-	99%	-	-	2.64	0.00	0.16	0.82
Alibaba Cloud	Qwen Audio 3.0 Realtime Flash, Alibaba Cloud	76.3%	96%	96.9%	35.9%	4.16	4.77	-	-
Alibaba Cloud	Qwen3.5 Omni Flash Realtime	-	59%	-	-	0.79	0.00	0.16	0.82
Alibaba Cloud	Qwen3 Omni Flash	-	59%	72.7%	-	4.82	1.77	0.61	1.36
Alibaba Cloud	Qwen3 Omni Realtime	-	57%	-	-	0.88	2.26	0.65	0.99
Amazon Bedrock	Nova 2.0 Sonic (Mar 2026)	-	88%	-	-	1.14	-	0.27	1.08
Cofe AI	FLM-Audio	-	16%	62.0%	-	-	-	-	-
Deepslate	Deepslate Opal	62.8%	85%	85.7%	17.5%	0.44	6.48	-	-
Google	Gemini 3.1 Flash - High	69.5%	97%	74.3%	37.7%	2.99	1.75	0.35	1.38
Google	Gemini 2.5 Flash Native Audio Dialog Thinking	-	91%	-	-	3.87	-	0.35	1.38
Google	Gemini 3.1 Flash - Minimal	56.6%	71%	72.3%	26.2%	0.96	1.50	0.35	1.38
Google	Gemini 2.5 Flash Native Audio Dialog	-	69%	-	-	0.63	1.42	0.35	1.38
Google	Gemini 2.5 Flash Native Audio Preview (Dec 2025)	-	-	44.0%	22.8%	-	-	-	-
Google	Gemini 2.5 Flash Native Audio Preview (Sep 2025)	-	-	30.3%	-	-	-	-	-
Kyutai	Moshi	-	4%	61.0%	-	-	-	-	-
NVIDIA	Nemotron Voicechat	-	27%	52.9%	-	-	-	-	-
NVIDIA	PersonaPlex	-	19%	91.0%	-	-	-	-	-
OpenAI	GPT-Realtime-2 (High)	77.2%	97%	95.3%	39.8%	1.14	4.14	1.15	4.61
OpenAI	GPT-Realtime-2.1 High, OpenAI	79.1%	96%	95.7%	45.7%	1.21	10.75	-	-
OpenAI	GPT-Realtime-2 (Medium)	75.3%	93%	95.2%	37.4%	1.22	3.97	1.15	4.61
OpenAI	GPT-Realtime-2.1 Minimal, OpenAI	72.5%	87%	92.7%	38.0%	0.97	11.31	-	-
OpenAI	GPT Realtime	69.2%	83%	93.9%	30.4%	0.98	11.08	1.15	4.61
OpenAI	GPT-Realtime-1.5	72.0%	81%	95.7%	38.8%	0.81	11.44	1.15	4.61
OpenAI	GPT-Realtime-2.1 Mini High, OpenAI	65.3%	75%	91.7%	29.4%	4.28	3.45	-	-
OpenAI	GPT-Realtime-2 (Minimal)	66.2%	72%	96.1%	30.8%	1.12	3.07	1.15	4.61
OpenAI	GPT-4o mini Realtime (Dec 2024)	-	69%	-	-	1.27	5.75	0.36	1.44
OpenAI	GPT Realtime Mini (Oct 2025)	58.1%	64%	95.7%	15.1%	0.81	3.04	0.36	1.44
OpenAI	GPT-Realtime-2.1 Mini Minimal, OpenAI	59.0%	63%	91.8%	22.5%	0.85	4.60	-	-
OpenAI	GPT-4o audio chatcompletions	-	54%	-	-	3.38	0.00	3.60	14.40
OpenAI	GPT-4o Realtime (Dec 2024)	-	-	89.8%	27.9%	1.49	2.04	1.44	5.76
SpaceXAI	Grok Voice Think Fast 2.0 High	82.9%	97%	95.1%	56.5%	0.70	4.80	-	-
SpaceXAI	Grok Voice Think Fast 1.0	75.7%	97%	77.8%	52.1%	1.25	3.00	-	-
SpaceXAI	Grok Voice Fast 1.0	64.1%	93%	71.6%	27.4%	0.78	3.00	3.00	3.00
StepFun	Step-Audio R1.1 (Realtime)	-	98%	-	-	1.53	0.00	0.06	1.69
VITA	Freeze-Omni	-	33%	58.7%	-	-	-	-	-

Frequently Asked Questions

Qwen Audio 3.0 Realtime Plus, Alibaba Cloud leads with a speech reasoning score of 99.2% on the Big Bench Audio dataset across 33 models evaluated. Qwen3.5 Omni Plus Realtime follows at 98.7% and Step-Audio R1.1 (Realtime) at 97.6%.

Qwen Audio 3.0 Realtime Plus, Alibaba Cloud leads with a conversational dynamics score of 98.4% on the Full Duplex Bench dataset across 27 models evaluated. Qwen Audio 3.0 Realtime Flash, Alibaba Cloud follows at 96.9% and GPT-Realtime-2 (Minimal) at 96.1%.

The top speech to speech models by reasoning quality are: 1. Qwen Audio 3.0 Realtime Plus, Alibaba Cloud (99.2%), 2. Qwen3.5 Omni Plus Realtime (98.7%), 3. Step-Audio R1.1 (Realtime) (97.6%), 4. Grok Voice Think Fast 2.0 High (97.2%), 5. Grok Voice Think Fast 1.0 (97.1%). Scores are based on the Big Bench Audio benchmark.

The top speech to speech models by conversational dynamics are: 1. Qwen Audio 3.0 Realtime Plus, Alibaba Cloud (98.4%), 2. Qwen Audio 3.0 Realtime Flash, Alibaba Cloud (96.9%), 3. GPT-Realtime-2 (Minimal) (96.1%), 4. GPT-Realtime-2.1 High, OpenAI (95.7%), 5. GPT-Realtime-1.5 (95.7%). Scores are based on the Full Duplex Bench benchmark.

Deepslate Opal has the lowest Time to First Audio at 0.44s, followed by Gemini 2.5 Flash Native Audio Dialog (0.63s) and Grok Voice Think Fast 2.0 High (0.70s). Lower values mean faster initial response.

Qwen Audio 3.0 Realtime Plus, Alibaba Cloud is the most affordable at $0.0331/hour for input audio, followed by Step-Audio R1.1 (Realtime) ($0.064/hour) and Qwen3.5 Omni Flash Realtime ($0.162/hour).

Full Duplex Bench evaluates how well speech to speech models handle real conversational behaviors: knowing when to speak, when to stay silent during pauses, how to respond to interruptions, and how to recognize backchannels like "yeah" or "mm-hmm". Artificial Analysis implements a subset of Full Duplex Bench v1 and v1.5.

Speech reasoning (Big Bench Audio) measures whether a model can understand and correctly answer reasoning questions delivered as audio. Conversational dynamics (Full Duplex Bench) measures whether a model can handle natural conversation flow — turn-taking, pauses, and interruptions. A model may excel at one but not the other.

Real-time conversation requires both strong conversational dynamics and low latency. By conversational dynamics score, Qwen Audio 3.0 Realtime Plus, Alibaba Cloud (98.4%), Qwen Audio 3.0 Realtime Flash, Alibaba Cloud (96.9%), and GPT-Realtime-2 (Minimal) (96.1%) lead the field. By Time to First Audio, Deepslate Opal (0.44s), Gemini 2.5 Flash Native Audio Dialog (0.63s), and Grok Voice Think Fast 2.0 High (0.70s) are the fastest. The best choice depends on whether natural conversation flow or response speed is more critical for your use case.

Benchmarks are updated regularly as new models and providers are added. Performance metrics are continuously monitored to reflect current provider capabilities and pricing changes.

Speech to Speech AI Model & Provider Leaderboard

Related Links

AA-Speech to Speech Index

Speed

Cost per Hour of Input Audio

Artificial Analysis Speech to Speech Index

Artificial Analysis Speech to Speech Index

Artificial Analysis Speech to Speech Index

Speech Reasoning

Speech Reasoning (Big Bench Audio)

Speech Reasoning

Agentic Performance

Agentic Performance (𝜏-Voice)

𝜏-Voice

Agentic Performance (𝜏-Voice) by Domain

𝜏-Voice

Conversational Dynamics

Conversational Dynamics (Full Duplex Bench subset)

Conversational Dynamics

Conversational Dynamics (Full Duplex Bench subset) - Category Breakdown

Conversational Dynamics

API Benchmarks

Artificial Analysis Speech to Speech Index vs. Cost per Hour of Input Audio

Artificial Analysis Speech to Speech Index

Cost per Hour of Input Audio (Big Bench Audio subset)

Cost per Hour of Input Audio

Time to First Audio

Time to First Audio

Summary of Key Metrics & Further Information

Frequently Asked Questions

Which speech to speech model has the best reasoning quality?

Which speech to speech model has the best conversational dynamics?

What are the top speech to speech models?

What are the top speech to speech models by conversational dynamics?

Which speech to speech model is the fastest?

Which speech to speech model is the cheapest?

What is Full Duplex Bench (FDB)?

What is the difference between speech reasoning and conversational dynamics?

Which speech to speech model is best for real-time conversation?

How often are speech to speech benchmarks updated?