NVIDIA: API Provider Benchmarking & Analysis
Highlights
Summary Analysis
Artificial Analysis Word Error Rate (AA-WER) Index
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate (AA-WER) Index by Individual Dataset
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate Index vs. Price
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate Index vs. Speed Factor
Artificial Analysis Word Error Rate (AA-WER) Index: Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Speed Factor vs. Price
Speed
Speed Factor
Speed Factor, Variance

Speed Factor, Over Time
Price
Price
Note: Groq chargers for a minimum of 10s per request.
Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
Whisper Large v2 | large-v2 | 15.8% | 30.0 | 6.00 | |||
Whisper Large v2 | large-v2 | 27.2% | 33.9 | 6.00 | |||
Whisper Large v3 | large-v3 | 16.8% | 312.1 | 0.50 | |||
![]() | Incredibly Fast Whisper | large-v3 | 18.2% | 63.2 | 1.49 | ||
![]() | Whisper Large v2 | large-v2 | 15.8% | 2.5 | 3.47 | ||
![]() | Whisper Large v3 | large-v3 | 24.6% | 2.5 | 4.23 | ||
![]() | WhisperX | large-v3 | 16.3% | 7.1 | 1.09 | ||
![]() | Whisper (M) | medium | 2.68 | ||||
![]() | Whisper (S) | small | 1.37 | ||||
![]() | Whisper Large v3 | large-v3 | 16.8% | 238.3 | 1.85 | ||
![]() | Distil-Whisper | 0.33 | |||||
![]() | Whisper Large v3 | large-v3 | 16.8% | 95.3 | 0.45 | ||
Whisper Large v3 | large-v3 | 16.8% | 141.9 | 1.15 | |||
![]() | Whisper Large v3 Turbo | v3 Turbo | 266.0 | 0.67 | |||
Whisper Large v3 | large-v3 | 332.9 | 1.00 | ||||
Whisper Large v3 Turbo | v3 Turbo | 17.8% | 463.2 | 1.00 | |||
![]() | Whisper-Large-v3 | large-v3 | 16.8% | 155.5 | 1.67 | ||
Whisper Large v3 | large-v3 | 24.6% | 115.4 | 1.50 | |||
![]() | Speechmatics Standard | 16.0% | 17.6 | 13.33 | |||
![]() | Speechmatics Enhanced | 17.8 | 6.70 | ||||
Azure AI Speech Service | 17.2% | 2.0 | 16.67 | ||||
![]() | Nano | 16.3% | 84.9 | 2.00 | |||
![]() | Universal, AssemblyAI | 14.5% | 86.2 | 6.17 | |||
![]() | Slam-1 | 15.2% | 71.8 | 4.50 | |||
![]() | Nova-2 | 17.3% | 492.8 | 4.30 | |||
![]() | Base | 21.9% | 554.6 | 12.50 | |||
![]() | Nova-3 | 18.3% | 462.0 | 4.30 | |||
Gladia v2 | whisper-v2-variant | 16.7% | 51.8 | 10.20 | |||
Amazon Transcribe | 14.0% | 20.2 | 24.00 | ||||
Fish Speech to Text | 0.00 | ||||||
![]() | Rev AI | 15.2% | 20.00 | ||||
Chirp | 16.9% | 13.5 | 16.00 | ||||
Chirp 2, Google | 11.6% | 18.0 | 16.00 | ||||
Chirp 3, Google | 15.0% | 28.7 | 16.00 | ||||
Scribe, ElevenLabs | 46.1 | 6.67 | |||||
Gemini 2.0 Flash | 17.9% | 54.2 | 1.40 | ||||
Gemini 2.0 Flash Lite | 16.6% | 49.0 | 0.19 | ||||
Gemini 2.5 Flash Lite | 16.1% | 101.3 | 0.58 | ||||
Gemini 2.5 Flash | 19.2% | 69.6 | 1.92 | ||||
Gemini 2.5 Pro | 15.0% | 12.7 | 0.00 | ||||
GPT-4o Transcribe | 21.3% | 26.7 | 6.00 | ||||
GPT-4o Mini Transcribe | 20.1% | 36.1 | 3.00 | ||||
Granite Speech 3.3 8B, IBM | 15.7% | 0.00 | |||||
![]() | Parakeet RNNT 1.1B | 6.7 | 1.91 | ||||
Parakeet TDT 0.6B V2, NVIDIA | 62.3 | 0.00 | |||||
![]() | Canary Qwen 2.5B, NVIDIA | 13.2% | 5.1 | 0.00 | |||
![]() | Voxtral Mini | 15.8% | 60.6 | 1.00 | |||
![]() | Voxtral Small | 14.7% | 68.3 | 4.00 | |||
![]() | Voxtral Small | 14.7% | 23.1 | 3.00 | |||
![]() | Voxtral Mini | 15.8% | 61.3 | 1.00 | |||
Qwen3 ASR Flash | 15.0% | 1.92 | |||||
Qwen3 Omni | 52.3% | 0.00 |
Speech to Text providers compared: OpenAI, Speechmatics, Microsoft Azure, AssemblyAI, fal.ai, Replicate, Deepgram, Gladia, Groq, Deepinfra, Fireworks, Amazon Bedrock, Fish Audio, Rev AI, Google, ElevenLabs, SambaNova, IBM, Together.ai, Mistral, NVIDIA, and Alibaba Cloud.