Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
For EnterpriseInsights
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • Articles
For EnterpriseInsights

Speech to Speech AI Model & Provider Leaderboard

Analysis and comparison of Speech to Speech models & API providers. Artificial Analysis has analyzed speech to speech models and hosting providers across different characteristics including their reasoning quality, generation time and price.

For further details, see the methodology page.

Highlights

Speech Reasoning
Speech Reasoning based on the Artificial Analysis Big Bench Audio dataset; Higher is better
Speed
Time to First Audio (seconds); Lower is better
Input Price
Price per Hour of Input Audio (USD); Lower is better

Navigation

Speech ReasoningAbout Big Bench AudioSummary AnalysisPriceSpeed

Speech Reasoning

Back to Navigation

Speech Reasoning (Big Bench Audio)

Speech Reasoning: Based on the Artificial Analysis Big Bench Audio dataset; Higher is better

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

About Big Bench Audio

The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.


To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.


Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models . The questions are based on four categories of the Big Bench Hard dataset and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.


To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio as described in the table and displayed in the chart above. To learn more about Big Bench Audio, review the article or download the dataset yourself.

Summary Analysis

Back to Navigation

Speech Reasoning vs Input Price

Based on the Artificial Analysis Big Bench Audio dataset, Higher is better; Price: USD per hour of input audio, Lower is better
Most attractive quadrant
Gemini 2.5 Flash Live Preview
Gemini 2.5 Flash Native Audio Dialog
Gemini 2.5 Flash Native Audio Dialog (Thinking)
GPT Realtime, Aug '25
GPT-4o mini Realtime, Dec '24
GPT-4o Realtime, Dec '24
GPT-4o, ChatCompletions Audio Preview
GPT-4o, Realtime Preview
GPT-Realtime Mini, Oct 25
Grok Voice Agent
Nova 2.0 Sonic
Qwen3 Omni 30B
Qwen3 Omni 30B (Realtime)

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Speech Reasoning vs Speed

Based on the Artificial Analysis Big Bench Audio dataset, Higher is better; Speed: Seconds to generate the first token of audio output, Lower is better
Most attractive quadrant
Gemini 2.5 Flash Live Preview
Gemini 2.5 Flash Native Audio Dialog
Gemini 2.5 Flash Native Audio Dialog (Thinking)
GPT Realtime, Aug '25
GPT-4o mini Realtime, Dec '24
GPT-4o Realtime, Dec '24
GPT-4o, ChatCompletions Audio Preview
GPT-4o, Realtime Preview
GPT-Realtime Mini, Oct 25
Grok Voice Agent
Nova 2.0 Sonic
Qwen3 Omni 30B
Qwen3 Omni 30B (Realtime)

Results from evaluation on the Artificial Analysis Big Bench Audio dataset. See methodology for further details.

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Price

Back to Navigation

Audio Input and Output Prices

USD per hour of audio; Lower is better
Input Price
Output Price

Price per hour of audio included in the request/message sent to the API, represented as USD per hour of audio.

Price per hour of audio generated by the model (received from the API), represented as USD per hour of audio.

Speed

Back to Navigation

Time to First Audio

Seconds to generate the first token of audio output; Lower is better

Number of seconds required to generate the first token of audio output. A lower value represents a faster generation.

Notes: ChatCompletions results were negatively impacted by failures to generate valid audio. See methodology for analysis.

Summary of key metrics & further information

Model NameFootnotesTime to first audio (secs)Price per hour of audio input (USD)Price per hour of audio output (USD)
Alibaba Cloud logoAlibaba Cloud
Qwen3 Omni 30B A3B, Alibaba Cloud logoQwen3 Omni 30B (Realtime)
-
59%0.880.650.99
Qwen3 Omni 30B A3B, Alibaba Cloud logoQwen3 Omni 30B
-
58%4.820.611.36
Amazon Bedrock logoAmazon Bedrock
Nova 2.0 Sonic, Amazon Bedrock logoNova 2.0 Sonic
-
87%1.390.271.08
xAI logoxAI
Grok Voice Agent, xAI logoGrok Voice Agent
-
92%0.783.003.00
OpenAI logoOpenAI
GPT-5 (ChatGPT), OpenAI logoGPT Realtime, Aug '25
-
83%0.981.154.61
GPT-4o mini Realtime (Dec), OpenAI logoGPT-4o mini Realtime, Dec '24
-
69%1.270.361.44
GPT-4o Realtime (Dec), OpenAI logoGPT-4o Realtime, Dec '24
-
68%1.491.445.76
GPT Realtime Mini (Oct '25), OpenAI logoGPT-Realtime Mini, Oct 25
-
68%0.810.361.44
GPT 4o Realtime, OpenAI logoGPT-4o, Realtime Preview
-
66%1.043.6014.40
GPT 4o Audio, OpenAI logoGPT-4o, ChatCompletions Audio Preview
54%3.383.6014.40
Google logoGoogle
Gemini 2.5 Flash Native Audio Dialog, Google logoGemini 2.5 Flash Native Audio Dialog (Thinking)
-
92%3.870.351.38
Gemini 2.5 Flash Live Preview, Google logoGemini 2.5 Flash Live Preview
-
74%0.640.351.38
Gemini 2.5 Flash Native Audio Dialog, Google logoGemini 2.5 Flash Native Audio Dialog
-
71%0.630.351.38
Gemini 2.0 Flash (exp) (AI Studio), Google logoGemini 2.0 Flash Experimental
-
36%1.940.03-

Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn