Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Speech to Speech AI Model & Provider Leaderboard

Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, generation time and price.

For further details, see the methodology page.

Highlights

Speech Reasoning
Speech Reasoning based on the Artificial Analysis Big Bench Audio dataset; Higher is better
Speed
Time to First Audio (seconds); Lower is better
Input Price
Price per Hour of Input Audio (USD); Lower is better

Speech Reasoning

Speech Reasoning (Big Bench Audio)

Speech Reasoning: Based on the Artificial Analysis Big Bench Audio dataset; Higher is better

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

About Big Bench Audio

The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.


To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.


Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models . The questions are based on four categories of the Big Bench Hard dataset and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.


To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio as described in the table and displayed in the chart above. To learn more about Big Bench Audio, review the article or download the dataset yourself.

Summary Analysis

Speech Reasoning vs Input Price

Based on the Artificial Analysis Big Bench Audio dataset, Higher is better; Price: USD per hour of input audio, Lower is better
Most attractive quadrant
Gemini 2.5 Flash Live Preview
Gemini 2.5 Flash Native Audio Dialog
Gemini 2.5 Flash Native Audio Dialog (Thinking)
GPT Realtime, Aug '25
GPT-4o mini Realtime, Dec '24
GPT-4o Realtime, Dec '24
GPT-4o, ChatCompletions Audio Preview
GPT-4o, Realtime Preview
GPT-Realtime Mini, Oct 25
Grok Voice Agent
Nova 2.0 Sonic
Qwen3 Omni 30B
Qwen3 Omni 30B (Realtime)
Step-Audio R1.1 (Realtime)

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Speech Reasoning vs Speed

Based on the Artificial Analysis Big Bench Audio dataset, Higher is better; Speed: Seconds to generate the first token of audio output, Lower is better
Most attractive quadrant
Gemini 2.5 Flash Live Preview
Gemini 2.5 Flash Native Audio Dialog
Gemini 2.5 Flash Native Audio Dialog (Thinking)
GPT Realtime, Aug '25
GPT-4o mini Realtime, Dec '24
GPT-4o Realtime, Dec '24
GPT-4o, ChatCompletions Audio Preview
GPT-4o, Realtime Preview
GPT-Realtime Mini, Oct 25
Grok Voice Agent
Nova 2.0 Sonic
Qwen3 Omni 30B
Qwen3 Omni 30B (Realtime)
Step-Audio R1.1 (Realtime)

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Audio Input and Output Prices

USD per hour of audio; Lower is better
Input Price
Output Price

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Price per hour of audio generated by the model (received from the API), represented as USD per hour of audio.

Time to First Audio

Seconds to generate the first token of audio output; Lower is better

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Summary of key metrics & further information

Model NameFootnotesTime to first audio (secs)Price per hour of audio input (USD)Price per hour of audio output (USD)
StepFun logoStepFun
Step-Audio R1.1 (Realtime), StepFun logoStep-Audio R1.1 (Realtime)
-
96%1.510.061.69
Alibaba Cloud logoAlibaba Cloud
Qwen3 Omni 30B A3B, Alibaba Cloud logoQwen3 Omni 30B (Realtime)
-
59%0.880.650.99
Qwen3 Omni 30B A3B, Alibaba Cloud logoQwen3 Omni 30B
-
58%4.820.611.36
Amazon Bedrock logoAmazon Bedrock
Nova 2.0 Sonic, Amazon Bedrock logoNova 2.0 Sonic
-
87%1.390.271.08
xAI logoxAI
Grok Voice Agent, xAI logoGrok Voice Agent
-
92%0.783.003.00
OpenAI logoOpenAI
GPT-5 (ChatGPT), OpenAI logoGPT Realtime, Aug '25
-
83%0.981.154.61
GPT-4o mini Realtime (Dec), OpenAI logoGPT-4o mini Realtime, Dec '24
-
69%1.270.361.44
GPT-4o Realtime (Dec), OpenAI logoGPT-4o Realtime, Dec '24
-
68%1.491.445.76
GPT Realtime Mini (Oct '25), OpenAI logoGPT-Realtime Mini, Oct 25
-
68%0.810.361.44
GPT 4o Realtime, OpenAI logoGPT-4o, Realtime Preview
-
66%1.043.6014.40
GPT 4o Audio, OpenAI logoGPT-4o, ChatCompletions Audio Preview
54%3.383.6014.40
Google logoGoogle
Gemini 2.5 Flash Native Audio Dialog, Google logoGemini 2.5 Flash Native Audio Dialog (Thinking)
-
92%3.870.351.38
Gemini 2.5 Flash Live Preview, Google logoGemini 2.5 Flash Live Preview
-
74%0.640.351.38
Gemini 2.5 Flash Native Audio Dialog, Google logoGemini 2.5 Flash Native Audio Dialog
-
71%0.630.351.38
Gemini 2.0 Flash (exp) (AI Studio), Google logoGemini 2.0 Flash Experimental
-
36%1.940.03-