Speech to Text AI Model & Provider Leaderboard
Analysis and comparison of Speech to Text transcription models & API providers. Artificial Analysis has analyzed speech to text models and hosting providers across different characteristics including their word error rate (lower is better), speed and price. Speed is represented by 'Speed Factor' which is the number of audio seconds transcribed per second (higher is better). For further details, see our methodology page.
Speech-to-text models & providers compared: Whisper (L, v2), OpenAI, Universal-1, Standard, Whisper (L, v2), Azure, Enhanced, Nano, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Nova-2, Whisper (L, v2), Replicate, Whisper (L, v3), Replicate, Base, WhisperX, Replicate, Whisper (L v2), Deepgram, Whisper (L, v3), Groq, Distil-Whisper, Groq, Whisper (L, v3), fal.ai, Whisper (L, v3), Deepinfra, Whisper (L, v3, Turbo), Groq, Whisper (L, v3), Fireworks, Whisper (L, v3, Turbo), Fireworks, Universal-2, Amazon Transcribe, Fish Speech to Text, Nova-3, Chirp, Chirp 2, Scribe, GPT-4o Transcribe, and GPT-4o Mini Transcribe.
Highlights
Summary Analysis
Word Error Rate vs. Price
Word Error Rate vs. Speed Factor
Speed Factor vs. Price
Quality
Word Error Rate
Speed
Speed Factor
Speed Factor, Variance

Speed Factor, Over Time
Price
Price
Note: Groq chargers for a minimum of 10s per request.
Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
Whisper Large v2 | large-v2 | 10.6% | 34.6 | 6.00 | |||
Whisper Large v2 | large-v2 | 10.6% | 34.3 | 6.00 | |||
![]() | Whisper Large v3 | large-v3 | 10.3% | 242.8 | 0.50 | ||
Incredibly Fast Whisper | large-v3 | 10.3% | 53.8 | 1.49 | |||
Whisper Large v2 | large-v2 | 11.2% | 2.5 | 3.47 | |||
Whisper Large v3 | large-v3 | 10.3% | 2.8 | 4.23 | |||
WhisperX | large-v3 | 10.9% | 33.9 | 1.09 | |||
![]() | Whisper Large v3 | large-v3 | 10.3% | 268.2 | 1.85 | ||
![]() | Distil-Whisper | 13.0% | 317.6 | 0.33 | |||
![]() | Whisper Large v3 | large-v3 | 10.3% | 88.1 | 1.15 | ||
![]() | Whisper Large v3 | large-v3 | 10.3% | 117.9 | 0.45 | ||
![]() | Whisper Large v3 Turbo | v3 Turbo | 12.0% | 302.0 | 0.67 | ||
![]() | Whisper Large v3 | large-v3 | 11.2% | 216.5 | 1.00 | ||
![]() | Whisper Large v3 Turbo | v3 Turbo | 13.7% | 360.2 | 1.00 | ||
![]() | Universal-1 | 8.7% | 83.5 | 6.17 | |||
![]() | Nano | 12.7% | 82.2 | 2.00 | |||
![]() | Universal-2 | 8.6% | 82.4 | 6.17 | |||
![]() | Standard | 12.6% | 17.6 | 13.33 | |||
![]() | Enhanced | 8.6% | 5.4 | 17.33 | |||
![]() | Nova-2 | 15.1% | 137.1 | 4.30 | |||
![]() | Base | 26.1% | 171.1 | 12.50 | |||
![]() | Whisper Large v2 | large-v2 | 10.6% | 28.6 | 4.80 | ||
![]() | Nova-3 | 12.8% | 152.8 | 4.30 | |||
![]() | Amazon Transcribe | 11.2% | 16.7 | 24.00 | |||
![]() | Fish Speech to Text | 19.1% | 24.9 | 0.00 | |||
Chirp | 12.4% | 15.4 | 16.00 | ||||
Chirp 2 | 9.8% | 73.4 | 16.00 | ||||
![]() | Scribe | 7.7% | 50.2 | 6.67 | |||
GPT-4o Transcribe | 8.9% | 36.0 | 6.00 | ||||
GPT-4o Mini Transcribe | 13.2% | 44.9 | 3.00 |