Speech to Text (ASR): Leaderboard & Comparison
Analysis and comparison of Speech to Text transcription models & API providers. Artificial Analysis has analyzed speech to text models and hosting providers across different characteristics including their word error rate (lower is better), speed and price. Speed is represented by 'Speed Factor' which is the number of audio seconds transcribed per second (higher is better). For further details, see our methodology page.
Speech-to-text models & providers compared: Whisper (L, v2), OpenAI, Universal-1, AssemblyAI, Speechmatics Standard, Whisper (L, v2), Azure, Speechmatics Enhanced, Nano, Assembly AI, Incredibly Fast Whisper, Replicate, Whisper (L, v2), Replicate, Nova-2, Deepgram, Base, Deepgram, Whisper (L, v3), Replicate, WhisperX, Replicate, Whisper (L v2), Deepgram, Gladia, Whisper (M), Replicate, Whisper (S), Replicate, Whisper (L, v3), fal.ai, Amazon Transcribe, Rev AI, and Chirp, Google.
Highlights
Summary analysis
Word Error Rate vs. Price
Word Error Rate vs. Speed Factor
Speed Factor vs. Price
Quality
Word Error Rate
Speed
Speed Factor
Speed Factor, Variance
Speed Factor, Over Time
Price
Price
Host | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
OpenAI | Whisper (large-v2) | large-v2 | 9.2% | 35.5 | 6.00 | ||
Microsoft Azure | Whisper (large-v2) | large-v2 | 9.2% | 37.3 | 6.00 | ||
Replicate | Incredibly Fast Whisper | large-v3 | 8.2% | 43.6 | 1.49 | ||
Replicate | Whisper (large-v2) | large-v2 | 9.2% | 2.4 | 3.47 | ||
Replicate | Whisper (large-v3) | large-v3 | 8.4% | 2.7 | 4.23 | ||
Replicate | WhisperX | large-v3 | 9.7% | 12.3 | 1.09 | ||
Replicate | Whisper (medium) | medium | 10.7% | 5.4 | 2.68 | ||
Replicate | Whisper (small) | small | 15.0% | 18.1 | 1.37 | ||
fal.ai | Whisper, fal.ai | large-v3 | 8.2% | 109.6 | 1.15 | ||
AssemblyAI | AssemblyAI (Universal-1) | 7.2% | 43.0 | 6.17 | |||
AssemblyAI | Nano | 10.5% | 54.0 | 2.00 | |||
Speechmatics | Speechmatics Standard | 10.5% | 17.7 | 13.33 | |||
Speechmatics | Speechmatics Enhanced | 6.5% | 9.2 | 17.33 | |||
Deepgram | Nova-2 | 11.5% | 178.5 | 4.30 | |||
Deepgram | Base | 22.9% | 207.0 | 12.50 | |||
Deepgram | Whisper Large v2 | large-v2 | 9.2% | 61.0 | 4.80 | ||
Gladia | Gladia | whisper-v2 based | 10.7% | 10.0 | 10.20 | ||
Amazon Bedrock | Amazon Transcribe | 9.9% | 23.8 | 24.00 | |||
Rev AI | Rev AI | 18.1% | 9.0 | 20.00 | |||
Cloud Speech-To-Text (Chirp) | 10.8% | 14.1 | 16.00 |