Google Chirp: API Provider Benchmarking & Analysis
Highlights
Artificial Analysis Word Error Rate (AA-WER) Index by API
Artificial Analysis Word Error Rate (AA-WER) Index by API
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate (AA-WER) Index by Individual Dataset
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Artificial Analysis Word Error Rate (AA-WER) Index vs Other Metrics
Artificial Analysis Word Error Rate Index vs. Price
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Artificial Analysis Word Error Rate Index vs. Speed Factor
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Speed Factor
Speed Factor
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Speed Factor Variance
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Speed Factor, Over Time
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Speed Factor vs. Price
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Price
Price of Transcription
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.
Note: Groq chargers for a minimum of 10s per request.
| Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
|---|---|---|---|---|---|---|---|
| Whisper Large v2 | large-v2 | 15.8% | 27.7 | 6.00 | |||
| Whisper Large v2 | large-v2 | 27.2% | 33.2 | 6.00 | |||
| Whisper Large v3 | large-v3 | 16.8% | 305.4 | 0.50 | |||
| Incredibly Fast Whisper | large-v3 | 18.2% | 63.3 | 1.49 | |||
| Whisper Large v2 | large-v2 | 15.8% | 2.3 | 3.47 | |||
| Whisper Large v3 | large-v3 | 24.6% | 2.7 | 4.23 | |||
| WhisperX | large-v3 | 16.3% | 12.6 | 1.09 | |||
| Whisper (M) | medium | 2.68 | |||||
| Whisper (S) | small | 1.37 | |||||
| Whisper Large v3 | large-v3 | 16.8% | 192.4 | 1.85 | |||
| Distil-Whisper | 0.33 | ||||||
| Whisper Large v3 | large-v3 | 16.8% | 100.1 | 0.45 | |||
| Whisper Large v3 | large-v3 | 16.8% | 132.8 | 1.15 | |||
| Whisper Large v3 Turbo | v3 Turbo | 274.3 | 0.67 | ||||
| Whisper Large v3 | large-v3 | 445.6 | 1.00 | ||||
| Whisper Large v3 Turbo | v3 Turbo | 17.8% | 447.2 | 1.00 | |||
| Whisper-Large-v3 | large-v3 | 16.8% | 129.0 | 1.67 | |||
| Whisper Large v3 | large-v3 | 24.6% | 123.8 | 1.50 | |||
| Speechmatics Standard | 16.0% | 43.4 | 4.00 | ||||
| Speechmatics Enhanced | 14.4% | 24.2 | 6.70 | ||||
| Azure AI Speech Service | 17.2% | 2.0 | 16.67 | ||||
| Nano | 16.3% | 85.9 | 2.00 | ||||
| Universal 2, AssemblyAI | 14.5% | 85.8 | 2.50 | ||||
| Slam-1 | 15.2% | 59.1 | 4.50 | ||||
| Nova-2 | 17.3% | 593.4 | 4.30 | ||||
| Base | 21.9% | 484.7 | 12.50 | ||||
| Nova-3 | 18.3% | 425.2 | 4.30 | ||||
| Gladia v2 | whisper-v2-variant | 16.7% | 49.9 | 10.20 | |||
| Amazon Transcribe | 14.0% | 19.3 | 24.00 | ||||
| Fish Speech to Text | 25.4 | 0.00 | |||||
| Rev AI | 15.2% | 20.00 | |||||
| Chirp | 16.9% | 13.6 | 16.00 | ||||
| Chirp 2, Google | 11.6% | 17.4 | 16.00 | ||||
| Chirp 3, Google | 15.0% | 31.4 | 16.00 | ||||
| Scribe, ElevenLabs | 43.3 | 6.67 | |||||
| Gemini 2.0 Flash | 17.9% | 55.6 | 1.40 | ||||
| Gemini 2.0 Flash Lite | 16.6% | 62.0 | 0.19 | ||||
| Gemini 2.5 Flash Lite | 16.1% | 77.4 | 0.58 | ||||
| Gemini 2.5 Flash | 19.2% | 62.9 | 1.92 | ||||
| Gemini 2.5 Pro | 15.0% | 10.2 | 0.00 | ||||
| GPT-4o Transcribe | 21.3% | 28.8 | 6.00 | ||||
| GPT-4o Mini Transcribe | 20.1% | 30.3 | 3.00 | ||||
| Granite Speech 3.3 8B, IBM | 15.7% | 0.00 | |||||
| Parakeet RNNT 1.1B | 6.4 | 1.91 | |||||
| Parakeet TDT 0.6B V2, NVIDIA | 62.4 | 0.00 | |||||
| Canary Qwen 2.5B, NVIDIA | 13.2% | 4.6 | 0.00 | ||||
| Voxtral Mini | 15.8% | 62.3 | 1.00 | ||||
| Voxtral Small | 14.7% | 68.8 | 4.00 | ||||
| Voxtral Small | 14.7% | 23.4 | 3.00 | ||||
| Voxtral Mini | 15.8% | 66.8 | 1.00 | ||||
| Qwen3 ASR Flash | 15.0% | 1.92 | |||||
| Qwen3 Omni | 52.3% | 0.00 | |||||
| Qwen3 Omni Captioner | 5.72 |
Speech to Text providers compared: OpenAI, Speechmatics, Microsoft Azure, AssemblyAI, fal.ai, Replicate, Deepgram, Gladia, Groq, Deepinfra, Fireworks, Amazon Bedrock, Fish Audio, Rev AI, Google, ElevenLabs, SambaNova, IBM, Together.ai, Mistral, NVIDIA, and Alibaba Cloud.