Whisper: API Provider Benchmarking & Analysis
Highlights
Summary Analysis
Artificial Analysis Word Error Rate (AA-WER) Index
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate (AA-WER) Index by Individual Dataset
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate Index vs. Price
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate Index vs. Speed Factor
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Speed Factor vs. Price
Speed
Speed Factor
Speed Factor, Variance

Speed Factor, Over Time
Price
Price
Note: Groq chargers for a minimum of 10s per request.
Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
Whisper Large v2 | large-v2 | 15.8% | 29.4 | 6.00 | |||
Whisper Large v2 | large-v2 | 27.2% | 34.0 | 6.00 | |||
Whisper Large v3 | large-v3 | 16.8% | 312.5 | 0.50 | |||
![]() | Incredibly Fast Whisper | large-v3 | 18.2% | 61.0 | 1.49 | ||
![]() | Whisper Large v2 | large-v2 | 15.8% | 2.4 | 3.47 | ||
![]() | Whisper Large v3 | large-v3 | 24.6% | 3.1 | 4.23 | ||
![]() | WhisperX | large-v3 | 16.3% | 7.7 | 1.09 | ||
![]() | Whisper (M) | medium | 2.68 | ||||
![]() | Whisper (S) | small | 1.37 | ||||
![]() | Whisper Large v3 | large-v3 | 16.8% | 212.3 | 1.85 | ||
![]() | Distil-Whisper | 0.33 | |||||
Whisper Large v3 | large-v3 | 16.8% | 142.0 | 1.15 | |||
![]() | Whisper Large v3 | large-v3 | 16.8% | 98.0 | 0.45 | ||
![]() | Whisper Large v3 Turbo | v3 Turbo | 256.6 | 0.67 | |||
Whisper Large v3 | large-v3 | 416.6 | 1.00 | ||||
Whisper Large v3 Turbo | v3 Turbo | 17.8% | 456.8 | 1.00 | |||
![]() | Whisper-Large-v3 | large-v3 | 16.8% | 169.3 | 1.67 | ||
Whisper Large v3 | large-v3 | 24.6% | 140.7 | 1.50 | |||
![]() | Speechmatics Standard | 16.0% | 17.6 | 13.33 | |||
![]() | Speechmatics Enhanced | 17.6 | 6.70 | ||||
Azure AI Speech Service | 17.2% | 2.0 | 16.67 | ||||
![]() | Nano | 16.3% | 85.0 | 2.00 | |||
![]() | Universal, AssemblyAI | 14.5% | 85.6 | 6.17 | |||
![]() | Slam-1 | 15.2% | 52.5 | 4.50 | |||
![]() | Nova-2 | 17.3% | 162.9 | 4.30 | |||
![]() | Base | 21.9% | 184.2 | 12.50 | |||
![]() | Nova-3 | 18.3% | 179.1 | 4.30 | |||
Gladia v2 | whisper-v2-variant | 16.7% | 50.4 | 10.20 | |||
Amazon Transcribe | 14.0% | 17.3 | 24.00 | ||||
Fish Speech to Text | 0.00 | ||||||
![]() | Rev AI | 15.2% | 20.00 | ||||
Chirp 2, Google | 11.6% | 18.1 | 16.00 | ||||
Chirp | 16.9% | 13.3 | 16.00 | ||||
Chirp 3, Google | 15.0% | 30.2 | 16.00 | ||||
Scribe, ElevenLabs | 40.8 | 6.67 | |||||
Gemini 2.0 Flash | 17.9% | 51.5 | 1.40 | ||||
Gemini 2.0 Flash Lite | 16.6% | 52.9 | 0.19 | ||||
Gemini 2.5 Flash Lite | 16.1% | 101.3 | 0.58 | ||||
Gemini 2.5 Flash | 19.2% | 79.0 | 1.92 | ||||
Gemini 2.5 Pro | 15.0% | 11.7 | 0.00 | ||||
GPT-4o Transcribe | 21.3% | 26.8 | 6.00 | ||||
GPT-4o Mini Transcribe | 20.1% | 38.6 | 3.00 | ||||
Granite Speech 3.3 8B, IBM | 15.7% | 0.00 | |||||
![]() | Parakeet RNNT 1.1B | 6.6 | 1.91 | ||||
Parakeet TDT 0.6B V2, NVIDIA | 60.1 | 0.00 | |||||
![]() | Canary Qwen 2.5B, NVIDIA | 13.2% | 145.3 | 0.00 | |||
![]() | Voxtral Mini | 15.8% | 65.2 | 1.00 | |||
![]() | Voxtral Small | 14.7% | 67.8 | 4.00 | |||
![]() | Voxtral Small | 14.7% | 23.5 | 3.00 | |||
![]() | Voxtral Mini | 15.8% | 63.9 | 1.00 | |||
Qwen3 ASR Flash | 15.0% | 1.92 |
Speech to Text providers compared: OpenAI, Speechmatics, Microsoft Azure, AssemblyAI, fal.ai, Replicate, Deepgram, Gladia, Groq, Deepinfra, Fireworks, Amazon Bedrock, Fish Audio, Rev AI, Google, ElevenLabs, SambaNova, IBM, Together.ai, Mistral, NVIDIA, and Alibaba Cloud.