Speech to Text (ASR): Leaderboard & Comparison

Analysis and comparison of Speech to Text transcription models & API providers. Artificial Analysis has analyzed speech to text models and hosting providers across different characteristics including their word error rate (lower is better), speed and price. Speed is represented by 'Speed Factor' which is the number of audio seconds transcribed per second (higher is better). For further details, see our methodology page.

Speech-to-text models & providers compared: Whisper (L, v2), OpenAI, Universal-1, AssemblyAI, Speechmatics Standard, Whisper (L, v2), Azure, Speechmatics Enhanced, Nano, Assembly AI, Incredibly Fast Whisper, Replicate, Whisper (L, v2), Replicate, Nova-2, Deepgram, Base, Deepgram, Whisper (L, v3), Replicate, WhisperX, Replicate, Whisper (L v2), Deepgram, Gladia, Whisper (M), Replicate, Whisper (S), Replicate, Whisper (L, v3), fal.ai, Amazon Transcribe, Rev AI, and Chirp, Google.

Highlights

Word error rate
Word error rate: % of words transcribed incorrectly, Lower is better
Speed Factor
Speed factor: Input audio seconds transcribed per second, Higher is better
Price
Price: USD per 1000 minutes of audio, Lower is better

Summary analysis

Word Error Rate vs. Price

Word error rate: % of words transcribed incorrectly, Price: USD per 1000 minutes of audio
Most attractive quadrant
Size represents Speed factor: Input audio seconds transcribed per second
Word Error Rate: Percentage of words incorrect in the transcription.
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.

Word Error Rate vs. Speed Factor

Word error rate: % of words transcribed incorrectly, Speed factor: Input audio seconds transcribed per second
Most attractive quadrant
Size represents Price: USD per 1000 minutes of audio
Word Error Rate: Percentage of words incorrect in the transcription.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).

Speed Factor vs. Price

Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Word Error Rate

Word error rate: % of words transcribed incorrectly, Lower is better
Word Error Rate: Percentage of words incorrect in the transcription.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.

Speed Factor

Speed factor: Input audio seconds transcribed per second, Higher is better
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).

Speed Factor, Variance

Speed factor: Input audio seconds transcribed per second, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Boxplot: Shows variance of measurements
Picture of the author

Speed Factor, Over Time

Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Price

Price: USD per 1000 minutes of audio, Lower is better
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio. Further detail present on methodology page.
Summary of key metrics & further information
HostFurther
Details
Whisper (large-v2) logoOpenAI
Whisper (large-v2) logoMicrosoft Azure
Incredibly Fast Whisper logoReplicate
Whisper (large-v2) logoReplicate
Whisper (large-v3) logoReplicate
WhisperX logoReplicate
Whisper (medium) logoReplicate
Whisper (small) logoReplicate
Whisper, fal.ai logofal.ai
AssemblyAI (Universal-1) logoAssemblyAI
Nano logoAssemblyAI
Speechmatics Standard logoSpeechmatics
Speechmatics Enhanced logoSpeechmatics
Nova-2 logoDeepgram
Base logoDeepgram
Whisper Large v2 logoDeepgram
Gladia logoGladia
Amazon Transcribe logoAmazon Bedrock
Rev AI logoRev AI
Cloud Speech-To-Text (Chirp) logoGoogle