Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Speech to Speech Benchmarking Methodology

Overview

Our current Speech to Speech benchmarking evaluates native audio models - models that support native audio input and output - across two quality dimensions: Speech Reasoning and Conversational Dynamics.

Speech Reasoning

Overview

Our Speech Reasoning benchmark evaluates the ability of native audio models to answer reasoning-based questions.

Native audio models are provided with an input audio file and are expected to generate an output audio The judge model is provided with the candidate answer, official answer and original question as context and is prompted to label the candidate answer as correct or incorrect.

Dataset: Big Bench Audio

The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.

To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.

Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models. The questions are based on four categories of the Big Bench Hard dataset (250 questions each) and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.

Question Categories

Formal Fallacies (250 questions): Determine whether an argument presented informally can be logically deduced from the provided context.

Example: First of all, everyone who is a close friend of Glenna is a close friend of Tamara, too. Next, whoever is neither a half-sister of Deborah nor a workmate of Nila is a close friend of Glenna. Hence, whoever is none of this: a half-sister of Deborah or workmate of Nila, is a close friend of Tamara. Is the argument deductively valid or invalid?

Navigate (250 questions): Determine whether a series of navigation steps returns the agent to the starting point.

Example: If you follow these instructions, do you return to the starting point? Take 10 steps. Turn around. Take 4 steps. Take 6 steps. Turn around.

Object Counting (250 questions): Count the number of a specific item class given a collection of possessions.

Example: I have three blackberries, two strawberries, an apple, three oranges, a nectarine, a grape, a peach, a banana, and a plum. How many fruits do I have?

Web of Lies (250 questions): Evaluate the truth value of a Boolean function expressed as a natural-language word problem.

Example: Leda tells the truth. Alexis says Leda lies. Sal says Alexis lies. Phoebe says Sal tells the truth. Gwenn says Phoebe tells the truth. Does Gwenn tell the truth?

To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio. To learn more about Big Bench Audio, review the article or download the dataset yourself.

Conversational Dynamics

Overview

Our Conversational Dynamics benchmark evaluates the ability of native audio models to handle realistic conversational behaviors - the kinds of interactions that occur naturally in human conversation but are challenging for speech models to manage correctly.

This benchmark is implemented by Artificial Analysis based on a subset of Full Duplex Bench v1 (Lin et al., 2025) and Full Duplex Bench v1.5 (Lin et al., 2025), benchmarks that systematically evaluate key interactive behaviors of full duplex spoken dialogue models.

Metrics

From Full Duplex Bench v1:

  • Pause Handling: Percentage of samples where the model correctly does not interrupt during a user's natural pause. Evaluates whether the model recognizes that the speaker still holds the floor.
  • Turn Taking: Percentage of samples where the model correctly takes the conversational turn when appropriate. Measures the model's ability to detect turn boundaries and respond promptly.

From Full Duplex Bench v1.5:

  • User Interruption Handling: Percentage of samples where the model correctly addresses the user's interruption - responding to questions or changes in topic raised mid-conversation.
  • Backchannel Handling: Percentage of samples where the model correctly continues its response when a backchannel such as "yeah", "alright", or "mm-hmm" is played, rather than treating it as a new turn.

Price

  • Price per Hour of Input Audio: Total cost (USD) of audio included in the request / message sent to the API.
  • Price per Hour of Output Audio: Total cost (USD) of audio generated by the model (received from the API).

Speed

  • Time to First Audio (TTFA): Average number of seconds required to generate the first token of audio output, measured across the Big Bench Audio question set. TTFA is a critical indicator of perceived responsiveness in voice agent applications.