Speech Reasoning Benchmarking Overview
Our Speech Reasoning benchmark evaluates the ability of models that support native audio input and output, referred to as "native audio models", to answer reasoning-based questions.
Native audio models are provided with an input audio file and are expected to generate an output audio file that contains the answer to the question included in the input audio file. No additional information is provided to the native audio model.
The output audio file from the native audio model is transcribed, forming a "candidate answer". This candidate answer is then provided to an automatic evaluation system that leverages an AI model as a judge. The judge model is provided with the candidate answer, official answer and original question as context and is prompted to label the candidate answer as correct or incorrect.
To serve as a comparison, we also report Speech Reasoning results for the following configurations, Speech to Text, Text to Speech and Text to Text. A description of these configurations is as follows:
- Speech to Text: An input audio file is provided and the model generates a text answer.
- Text to Speech: A text version of the question is provided and the model generates an output audio file containing the answer.
- Text to Text: A text version of the question is provided and the model generates a text answer.
All configurations are evaluated using the same automatic evaluation system. For configurations that output text directly, the text response serves as the candidate answer and no post-processing is performed. Evaluation is performed on the Artificial Analysis Big Bench Audio dataset. More information about this dataset can be found here.
Speech Reasoning Benchmarking Overview
Our Speech Reasoning benchmark evaluates the ability of models that support native audio input and output, referred to as “native audio models”, to answer reasoning-based questions.
Native audio models are provided with an input audio file and are expected to generate an output audio file that contains the answer to the question included in the input audio file. No additional information is provided to the native audio model.
The output audio file from the native audio model is transcribed, forming a “candidate answer”. This candidate answer is then provided to an automatic evaluation system that leverages an AI model as a judge. The judge model is provided with the candidate answer, official answer and original question as context and is prompted to label the candidate answer as correct or incorrect.
To serve as a comparison, we also report Speech Reasoning results for the following configurations, Speech to Text, Text to Speech and Text to Text. A description of these configurations is as follows:
- Speech to Text: An input audio file is provided and the model generates a text answer.
- Text to Speech: A text version of the question is provided and the model generates an output audio file containing the answer.
- Text to Text: A text version of the question is provided and the model generates a text answer.
All configurations are evaluated using the same automatic evaluation system. For configurations that output text directly, the text response serves as the candidate answer and no post-processing is performed. Evaluation is performed on the Artificial Analysis Big Bench Audio dataset. More information about this dataset can be found here.