Speech to Speech Benchmarking Methodology

Overview

Our current Speech to Speech benchmarking evaluates native audio models - models that support native audio input and output - across three quality dimensions: speech reasoning, conversational dynamics and agentic performance.

Speech Reasoning

Overview

Our speech reasoning benchmark evaluates the ability of native audio models to answer reasoning-based questions.

Native audio models are provided with an input audio file and are expected to generate an output audio The judge model is provided with the candidate answer, official answer and original question as context and is prompted to label the candidate answer as correct or incorrect.

Dataset: Big Bench Audio

The emergence of native audio-to-audio models offers exciting opportunities to increase voice agent capabilities and simplify workflows. However, it's crucial to evaluate whether this simplification comes at the cost of model performance or introduces other trade-offs.

To help answer this question we've released Big Bench Audio, a new dataset for benchmarking the performance of native audio models.

Big Bench Audio contains 1,000 audio files representing questions designed to test the intelligence of models. The questions are based on four categories of the Big Bench Hard dataset (250 questions each) and were generated using 23 synthetic voices from top-ranked text to speech models in the Artificial Analysis Text to Speech Arena.

Question Categories

Formal Fallacies (250 questions): Determine whether an argument presented informally can be logically deduced from the provided context.

Example: First of all, everyone who is a close friend of Glenna is a close friend of Tamara, too. Next, whoever is neither a half-sister of Deborah nor a workmate of Nila is a close friend of Glenna. Hence, whoever is none of this: a half-sister of Deborah or workmate of Nila, is a close friend of Tamara. Is the argument deductively valid or invalid?

Navigate (250 questions): Determine whether a series of navigation steps returns the agent to the starting point.

Example: If you follow these instructions, do you return to the starting point? Take 10 steps. Turn around. Take 4 steps. Take 6 steps. Turn around.

Object Counting (250 questions): Count the number of a specific item class given a collection of possessions.

Example: I have three blackberries, two strawberries, an apple, three oranges, a nectarine, a grape, a peach, a banana, and a plum. How many fruits do I have?

Web of Lies (250 questions): Evaluate the truth value of a Boolean function expressed as a natural-language word problem.

Example: Leda tells the truth. Alexis says Leda lies. Sal says Alexis lies. Phoebe says Sal tells the truth. Gwenn says Phoebe tells the truth. Does Gwenn tell the truth?

To enable evaluation of the tradeoffs associated with using native speech to speech models we test multiple different configurations on Big Bench Audio. To learn more about Big Bench Audio, review the article or download the dataset yourself.

Conversational Dynamics

Overview

Our conversational dynamics benchmark evaluates the ability of native audio models to handle realistic conversational behaviors - the kinds of interactions that occur naturally in human conversation but are challenging for speech models to manage correctly.

This benchmark is implemented by Artificial Analysis based on a subset of Full Duplex Bench v1 (Lin et al., 2025) and Full Duplex Bench v1.5 (Lin et al., 2025), benchmarks that systematically evaluate key interactive behaviors of full duplex spoken dialogue models.

Metrics

From Full Duplex Bench v1:

Pause Handling: Percentage of samples where the model correctly does not interrupt during a user's natural pause. Evaluates whether the model recognizes that the speaker still holds the floor.
Turn Taking: Percentage of samples where the model correctly takes the conversational turn when appropriate. Measures the model's ability to detect turn boundaries and respond promptly.

From Full Duplex Bench v1.5:

User Interruption Handling: Percentage of samples where the model correctly addresses the user's interruption - responding to questions or changes in topic raised mid-conversation.
Backchannel Handling: Percentage of samples where the model correctly continues its response when a backchannel such as "yeah", "alright", or "mm-hmm" is played, rather than treating it as a new turn.

Agentic Performance (𝜏-Voice)

Overview

Our Agentic Performance benchmark evaluates the ability of Speech to Speech models to complete realistic customer service tasks end-to-end. This benchmark measures multi-turn instruction following, the ability to support a simulated customer through a complete interaction, and successful tool use against simulated customer service systems.

This benchmark is implemented by Artificial Analysis based on 𝜏-Voice (Ray, Dhandhania, Barres & Narasimhan, 2026), a benchmark by Sierra that evaluates full duplex voice agents on grounded customer service tasks across real-world domains.

Metric

Task Completion (pass@1): Proportion of scenarios where the model correctly resolves the customer's issue. Each score is the mean of three independent trials. The tested model is prompted as a customer support agent with access to domain-specific tools and a policy document. Each scenario has a single valid database end-state; evaluation compares the final state against this ground truth.

We evaluate across three domains using the base task set:

Airline (50 scenarios): e.g., changing a flight, rebooking under policy constraints
Retail (114 scenarios): e.g., disputing a charge, processing a return
Telecom (114 scenarios): e.g., resolving a billing issue, troubleshooting a service problem

Voice Personas

Customer voices are generated using ElevenLabs, based on prompts adapted from the publicly available Sierra 𝜏-Voice implementation. We use two control personas representing standard English speakers, and five regular personas representing diverse accents and speaker profiles.

Control

Matt Delaney: Middle-aged white man from the American Midwest, calm and respectful.

Prompt: You are a middle-aged white man from the American Midwest. You always behave as if you are speaking out loud in a real-time conversation with a customer service agent. You are calm, clear, and respectful but also human. You sound like someone who's trying to be helpful and polite, even when you're slightly frustrated or in a hurry. You value efficiency but never sound robotic. You sometimes use contractions, informal phrasing, or small filler phrases ("yeah," "okay," "honestly," "no worries") to keep things natural. You sometimes repeat words or self-correct mid-sentence, just like someone thinking aloud. You sometimes ask polite clarifying questions or offer context ("I tried this earlier," "I'm not sure if that helps"). You rarely use formal, business-like or stiff language ("considerable," "retrieve," "representative"). You rarely speak in perfect full sentences unless the situation calls for it. Instead, you speak like a real person having a practical, respectful conversation.

Lisa Brenner: White woman in her late 40s from a suburban area, tense and impatient.

Prompt: You are a white woman in your late 40s from a suburban area. You always speak as if you are talking out loud to a customer service agent who is already wasting your time. You're not openly hostile (yet), but you are tense, impatient, and clearly annoyed. You act like this issue should have been resolved the first time, and the fact that you're following up is unacceptable. You often sound clipped, exasperated, or sarcastically polite. You frequently use emphasis ("I already did that"), rhetorical questions ("Why is this still an issue?"), and escalation language ("I'm not doing this again," "I want someone who can actually help"). You expect fast results and get irritated when things are repeated. You often mention how long you've been waiting or how many times you've called. You sometimes threaten escalation but without yelling. You never sound relaxed. You never use slow, reflective speech. You never thank the agent unless something gets resolved.

Regular

Mildred Kaplan: Elderly white woman in her early 80s, needs help with technology.

Prompt: You are an elderly white woman in your early 80s calling customer service for help with something your grandson or neighbor usually does.

Arjun Roy: Bengali man from Dhaka in his mid-30s, calm and direct, strong Bengali accent.

Prompt: A Bengali man from Dhaka, Bangladesh in his mid-30s calling customer service about a billing issue. His English carries a strong Bengali accent with soft consonants and soft d and r sounds. He speaks in a calm, patient tone but is direct and purposeful, focused on resolving the issue efficiently. His pacing is slow, distracted with a warm yet firm timbre. The speech sounds like it is coming from far away.

Wei Lin: Chinese woman from Sichuan in her late 20s, upbeat and matter-of-fact, strong Sichuan Mandarin accent.

Prompt: A Chinese woman in her late 20s from Sichuan, calling customer service about a credit card billing issue. She speaks English with a thick Sichuan Mandarin accent. She sounds upbeat, matter-of-fact, and distracted. Her tone is firm but polite, with fast pacing and smooth timbre. Ok audio quality.

Mamadou Diallo: Senegalese man in his mid-30s, hurried, strong French accent.

Prompt: A Senegalese man whose first language is French, in his mid-30s, calling customer service about a billing issue. He speaks English with a strong French accent. His tone is hurried, slightly annoyed, and matter-of-fact, as if he's been transferred between agents and just wants the problem fixed.

Priya Patil: Maharashtrian woman in her early 30s, focused and direct, strong Maharashtrian accent.

Prompt: A woman in her early 30s from Maharashtra, India, calling customer support from her mobile phone. She speaks Indian English with a strong Maharashtrian accent with noticeable regional intonation and rhythm. Her tone is slightly annoyed and hurried, matter-of-fact, and focused on getting the issue resolved quickly. Her voice has medium pitch, firm delivery, short sentences, and faint background room tone typical of a phone call.

Price

Price per Hour of Input Audio: Total cost (USD) of audio included in the request / message sent to the API.
Price per Hour of Output Audio: Total cost (USD) of audio generated by the model (received from the API).

Speed

Time to First Audio (TTFA): Average number of seconds required to generate the first token of audio output, measured across the Big Bench Audio question set. TTFA is a critical indicator of perceived responsiveness in voice agent applications.

Version History

Big Bench Audio (BBA) v1.2

May 2026—Present

Updated the judge model and grader harness to more reliably recognize correct final answers in verbose responses, including cases where the model discusses incorrect alternatives before giving the correct answer.

Big Bench Audio (BBA) v1.1

March 2026—Present

Accuracy measured as the number of correct answers out of 1,000, including questions where the model did not answer.
Claude Sonnet 4.6 used as the judge model.

Big Bench Audio (BBA) v1.0

December 2024—March 2026

Accuracy measured as the share of non-error responses answered correctly, so models were not penalized for questions where they did not answer.
Claude Sonnet 3.5 used as the judge model.

Speech to Speech Benchmarking Methodology

Overview

Speech Reasoning

Overview

Dataset: Big Bench Audio

Question Categories

Conversational Dynamics

Overview

Metrics

Agentic Performance (𝜏-Voice)

Overview

Metric

Voice Personas

Prompts to generate voices

Control

Regular

Price

Speed

Version History