Speech Reasoning Benchmarking Overview
Our Speech Reasoning benchmark evaluates the ability of models that support native audio input and output, referred to as "native audio models", to answer reasoning-based questions.
Native audio models are provided with an input audio file and are expected to generate an output audio file that contains the answer to the question included in the input audio file. No additional information is provided to the native audio model.
The output audio file from the native audio model is transcribed, forming a "candidate answer". This candidate answer is then provided to an automatic evaluation system that leverages an AI model as a judge. The judge model is provided with the candidate answer, official answer and original question as context and is prompted to label the candidate answer as correct or incorrect.
Evaluation is performed on the Artificial Analysis Big Bench Audio dataset. More information about this dataset can be found here.