
Speechmatics: API Provider Benchmarking & Analysis
Highlights
Artificial Analysis Word Error Rate (AA-WER) Index by API
Artificial Analysis Word Error Rate (AA-WER) Index by API
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate (AA-WER) Index vs Other Metrics
Artificial Analysis Word Error Rate Index vs. Price
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Speed Factor
Speed Factor
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Price
Price of Transcription
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.
Note: Groq chargers for a minimum of 10s per request.
| Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
|---|---|---|---|---|---|---|---|
| Whisper Large v2 | large-v2 | 15.8% | 30.3 | 6.00 | |||
| Whisper Large v2 | large-v2 | 27.2% | 34.5 | 6.00 | |||
| Whisper Large v3 | large-v3 | 16.8% | 279.2 | 0.50 | |||
| Incredibly Fast Whisper | large-v3 | 18.2% | 66.1 | 1.49 | |||
| Whisper Large v2 | large-v2 | 15.8% | 2.4 | 3.47 | |||
| Whisper Large v3 | large-v3 | 24.6% | 3.3 | 4.23 | |||
| WhisperX | large-v3 | 16.3% | 7.6 | 1.09 | |||
| Whisper Large v3 | large-v3 | 16.8% | 268.2 | 1.85 | |||
| Distil-Whisper | 0.33 | ||||||
| Whisper Large v3 | large-v3 | 16.8% | 132.7 | 1.15 | |||
| Whisper Large v3 | large-v3 | 16.8% | 88.0 | 0.45 | |||
| Whisper Large v3 Turbo | v3 Turbo | 324.0 | 0.67 | ||||
| Whisper Large v3 | large-v3 | 356.3 | 1.00 | ||||
| Whisper Large v3 Turbo | v3 Turbo | 17.8% | 413.9 | 1.00 | |||
| Whisper-Large-v3 | large-v3 | 16.8% | 88.7 | 1.67 | |||
| Whisper Large v3 | large-v3 | 24.6% | 121.5 | 1.50 | |||
| Whisper v1 | 1 | 0.00 | |||||
| Speechmatics Standard | 16.0% | 44.6 | 4.00 | ||||
| Speechmatics Enhanced | 14.4% | 24.4 | 6.70 | ||||
| Azure AI Speech Service | 17.2% | 2.0 | 16.67 | ||||
| Nova-2 | 17.3% | 371.9 | 4.30 | ||||
| Base | 21.9% | 315.9 | 12.50 | ||||
| Nova-3 | 18.3% | 431.3 | 4.30 | ||||
| Universal, AssemblyAI | 14.5% | 83.9 | 2.50 | ||||
| Slam-1 | 15.1% | 57.9 | 4.50 | ||||
| Amazon Transcribe | 14.0% | 18.4 | 24.00 | ||||
| Fish Speech to Text | 0.00 | ||||||
| Rev AI | 15.2% | 20.00 | |||||
| Chirp 2, Google | 11.6% | 17.3 | 16.00 | ||||
| Chirp | 16.9% | 14.1 | 16.00 | ||||
| Chirp 3, Google | 15.0% | 32.2 | 16.00 | ||||
| Scribe, ElevenLabs | 47.1 | 6.67 | |||||
| Gemini 2.0 Flash | 17.9% | 51.9 | 1.40 | ||||
| Gemini 2.0 Flash Lite | 16.6% | 60.0 | 0.19 | ||||
| Gemini 2.5 Flash Lite | 16.1% | 69.0 | 0.58 | ||||
| Gemini 2.5 Flash | 19.2% | 89.2 | 1.92 | ||||
| Gemini 2.5 Pro | 15.0% | 13.8 | 0.00 | ||||
| GPT-4o Transcribe | 21.3% | 27.4 | 6.00 | ||||
| GPT-4o Mini Transcribe | 20.1% | 32.1 | 3.00 | ||||
| Granite Speech 3.3 8B, IBM | 15.7% | 0.00 | |||||
| Parakeet RNNT 1.1B | 6.5 | 1.91 | |||||
| Parakeet TDT 0.6B V2, NVIDIA | 60.0 | 0.00 | |||||
| Canary Qwen 2.5B, NVIDIA | 13.2% | 5.7 | 0.74 | ||||
| Parakeet TDT 0.6B V3, Hathora | 13.0% | 37.6 | 1.32 | ||||
| Voxtral Mini | 15.8% | 58.2 | 1.00 | ||||
| Voxtral Small | 14.7% | 69.6 | 4.00 | ||||
| Voxtral Small | 14.7% | 14.9 | 3.00 | ||||
| Voxtral Mini | 15.8% | 55.7 | 1.00 | ||||
| Qwen3 ASR Flash | 15.0% | 1.92 | |||||
| Qwen3 Omni | 52.3% | 0.00 | |||||
| Qwen3 Omni Captioner | 5.72 | ||||||
| Solaria-1, Gladia | 17.4% | 49.3 | 8.33 | ||||
| Nova 2 Omni | 15.9% | 39.2 | 1.85 | ||||
| Nova 2 Pro | 15.8% | 23.2 | 3.10 |
Speech to Text providers compared: OpenAI, Speechmatics, Microsoft Azure, fal.ai, Replicate, Deepgram, Groq, DeepInfra, Fireworks, AssemblyAI, Amazon Bedrock, Fish Audio, Rev AI, Google, ElevenLabs, SambaNova, IBM, Together.ai, Mistral, NVIDIA, Alibaba Cloud, Gladia, and Hathora.