Speech to Text AI Model & Provider Leaderboard
Analysis and comparison of Speech to Text transcription models & API providers. Artificial Analysis has analyzed speech to text models and hosting providers across different characteristics including their word error rate (lower is better), speed and price. Speed is represented by 'Speed Factor' which is the number of audio seconds transcribed per second (higher is better). For further details, see our methodology page.
Speech-to-text models & providers compared: Whisper (L, v2), OpenAI, Universal-1, AssemblyAI, Speechmatics Standard, Whisper (L, v2), Azure, Azure Speech Service, Speechmatics Enhanced, Nano, Assembly AI, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Nova-2, Deepgram, Whisper (L, v2), Replicate, Whisper (L, v3), Replicate, Base, Deepgram, WhisperX, Replicate, Whisper (L v2), Deepgram, Gladia, Whisper (M), Replicate, Whisper (S), Replicate, Whisper (L, v3), Groq, Distil-Whisper, Groq, Whisper (L, v3), Deepinfra, Whisper (L, v3), fal.ai, Distil-Whisper, Deepinfra, Whisper Large v3 Turbo, Groq, Whisper (L, v3), Fireworks, Whisper Large v3 Turbo, Fireworks, Amazon Transcribe, Rev AI, and Chirp, Google.
Highlights
Summary analysis
Word Error Rate vs. Price
Word Error Rate vs. Speed Factor
Speed Factor vs. Price
Quality
Word Error Rate
Speed
Speed Factor
Speed Factor, Variance
Speed Factor, Over Time
Price
Price
Note: Groq chargers for a minimum of 10s per request.
Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
OpenAI | Whisper (large-v2) | large-v2 | 10.6% | 34.6 | 6.00 | ||
Microsoft Azure | Whisper (large-v2) | large-v2 | 10.6% | 37.1 | 6.00 | ||
fal.ai | Wizper (Large v3) | large-v3 | 10.3% | 223.7 | 0.50 | ||
Replicate | Incredibly Fast Whisper | large-v3 | 10.3% | 44.3 | 1.49 | ||
Replicate | Whisper (large-v2) | large-v2 | 11.2% | 3.3 | 3.47 | ||
Replicate | Whisper (large-v3) | large-v3 | 10.3% | 3.0 | 4.23 | ||
Replicate | WhisperX | large-v3 | 10.9% | 21.2 | 1.09 | ||
Replicate | Whisper (medium) | medium | 12.8% | 3.9 | 2.68 | ||
Replicate | Whisper (small) | small | 17.0% | 2.5 | 1.37 | ||
Groq | Whisper (large-v3) | large-v3 | 10.3% | 163.6 | 1.85 | ||
Groq | Distil-Whisper, Groq | 13.0% | 201.3 | 0.33 | |||
Deepinfra | Whisper (large-v3), Deepinfra | large-v3 | 10.3% | 119.1 | 0.45 | ||
fal.ai | Whisper (large-v3) | large-v3 | 10.3% | 92.2 | 1.15 | ||
Deepinfra | Distil-Whisper, Deepinfra | 13.0% | 170.6 | 0.18 | |||
Groq | Whisper (large-v3 Turbo) | v3 Turbo | 12.0% | 187.4 | 0.67 | ||
Fireworks | Whisper (large-v3) | large-v3 | 0.0% | 187.4 | 0.00 | ||
Fireworks | Whisper (large-v3 Turbo) | v3 Turbo | 0.0% | 251.4 | 0.00 | ||
AssemblyAI | AssemblyAI (Universal-1) | 8.7% | 57.9 | 6.17 | |||
AssemblyAI | Nano | 12.7% | 81.3 | 2.00 | |||
Speechmatics | Speechmatics Standard | 12.6% | 17.7 | 13.33 | |||
Speechmatics | Speechmatics Enhanced | 8.6% | 9.1 | 17.33 | |||
Microsoft Azure | Azure Speech Service | 12.6% | 2.0 | 16.67 | |||
Deepgram | Nova-2 | 15.1% | 133.6 | 4.30 | |||
Deepgram | Base | 26.1% | 167.3 | 12.50 | |||
Deepgram | Whisper Large v2 | large-v2 | 10.6% | 29.6 | 4.80 | ||
Gladia | Gladia | whisper-v2-variant | 12.9% | 20.2 | 10.20 | ||
Amazon Bedrock | Amazon Transcribe | 11.2% | 23.2 | 24.00 | |||
Rev AI | Rev AI | 0.0% | 9.0 | 20.00 | |||
Cloud Speech-To-Text (Chirp) | 12.4% | 14.9 | 16.00 |