Speech-to-Text Benchmarking Methodology

Key Metrics

Speed Factor:

How fast the API provider transcribes audio to text relative to the audio's duration. A Speed Factor above 1 indicates that the service transcribes faster than real-time (i.e. Speed factor of 2.0 indicates a 10 minute audio file is expected to be transcribed in 5 minutes). The formula is as follows:

Speed Factor = Audio Duration (Real Time) / Provider API Response Time

Word Error Rate

Word Error Rate evaluates the transcription accuracy of speech-to-text models by comparing the transcribed text to a reference transcript (actual text of what was said based on human verification).

The WER formula is:

Word Error Rate = (Number of Substitutions + Number of Insertions + Number of Deletions Of Words Required) / Total Number of Words in Reference

Price per 1,000 Minutes

Cost for transcribing 1,000 minutes of audio, based on real time audio duration (not processing time). Some providers (incl. Replicate, fal) price based on processing time rather than on audio duration. For these providers, we estimated cost per audio minute by measuring the average processing time charged for per minute of audio duration across ~100 audio files of varied types and lengths. We then multiply the expected processing time per minute of audio by the providers cost per minute of processing time.

Word Error Rate Evaluation

Artificial Analysis conducts its own first-party analysis of word error rate between models and providers (each combination is tested).

Key technical details:

API endpoints are tested directly to ensure the word-error-rate reflects actual performance of API users.
Currently our evaluation is based on English language transcription only. Language is specified to models where possible. We plan to expand the evaluation to other languages in the near-term. We expect other languages to be reported separately to the English WER.
Analysis is based on 5,000 English language samples randomly selected from Mozilla's Common Voice project (v16.1). The number of samples selected from the test set was increased in June 2024 from 1,000 samples to increase accuracy.
For models hosted by multiple providers, we evaluate providers individually and for those within 0.4% of the median WER amongst providers, we report the WER as the median WER of the providers. This is to account for provider variances that can occur on a relative basis across different test sets which do not reflect overall quality differences.
Text is normalized prior to comparison, adjustments include all text is lowercased and punctuation is removed, to focus the comparison on content rather than stylistic differences. We use OpenAI's Whisper's normalizer is used to make these adjustments.
WER is based on the Levenshtein distance between the reference and the hypothesis texts. This measures alignment through totalling the number of word substitutions, insertions, and deletions relative to the reference text length to analyze the accuracy of the hypothesis text and totalling to the WER %. We use the popular jiwer library which has been referenced by many organizations including OpenAI.

Testing Methodology

Our approach to benchmarking aims to provide an equitable comparison across all tested speech-to-text services. We benchmark providers audio file transcription rather than live-audio streaming offerings.

Key technical details:

Benchmarking is conducted four times daily.
Benchmarks are conducted with a 10-minute audio file which is compiled from Mozilla's Common Voice test dataset and contains people speaking in English on different topics in different tones.
Performance metrics, unless otherwise specified, represent the median measurement of the prior 14 days.
When we are required to poll for results we poll for completion as per the provider's library or otherwise every second. Given the audio duration is 10 minutes in length, variation of a second has neglible impact on accuracy.

Model Inclusion Criteria

Our objective is to represent the most popular and best-performing speech-to-text models to support end-users in choosing which models and providers to use. As such, we apply a 'industry significance' and competitive performance test to evaluate inclusion of new models & providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.

Artificial Analysis

Insights Login

Speech-to-Text Benchmarking Methodology

Key Metrics

Speed Factor:

Speed Factor = Audio Duration (Real Time) / Provider API Response Time

Word Error Rate

Word Error Rate evaluates the transcription accuracy of speech-to-text models by comparing the transcribed text to a reference transcript (actual text of what was said based on human verification).

The WER formula is:

Word Error Rate = (Number of Substitutions + Number of Insertions + Number of Deletions Of Words Required) / Total Number of Words in Reference

Price per 1,000 Minutes

Word Error Rate Evaluation

Artificial Analysis conducts its own first-party analysis of word error rate between models and providers (each combination is tested).

Key technical details:

API endpoints are tested directly to ensure the word-error-rate reflects actual performance of API users.
Currently our evaluation is based on English language transcription only. Language is specified to models where possible. We plan to expand the evaluation to other languages in the near-term. We expect other languages to be reported separately to the English WER.
Analysis is based on 5,000 English language samples randomly selected from Mozilla's Common Voice project (v16.1). The number of samples selected from the test set was increased in June 2024 from 1,000 samples to increase accuracy.
For models hosted by multiple providers, we evaluate providers individually and for those within 0.4% of the median WER amongst providers, we report the WER as the median WER of the providers. This is to account for provider variances that can occur on a relative basis across different test sets which do not reflect overall quality differences.
Text is normalized prior to comparison, adjustments include all text is lowercased and punctuation is removed, to focus the comparison on content rather than stylistic differences. We use OpenAI's Whisper's normalizer is used to make these adjustments.
WER is based on the Levenshtein distance between the reference and the hypothesis texts. This measures alignment through totalling the number of word substitutions, insertions, and deletions relative to the reference text length to analyze the accuracy of the hypothesis text and totalling to the WER %. We use the popular jiwer library which has been referenced by many organizations including OpenAI.

Testing Methodology

Key technical details:

Benchmarking is conducted four times daily.
Benchmarks are conducted with a 10-minute audio file which is compiled from Mozilla's Common Voice test dataset and contains people speaking in English on different topics in different tones.
Performance metrics, unless otherwise specified, represent the median measurement of the prior 14 days.
When we are required to poll for results we poll for completion as per the provider's library or otherwise every second. Given the audio duration is 10 minutes in length, variation of a second has neglible impact on accuracy.

Model Inclusion Criteria

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.