Speech to Text AI Model & Provider Leaderboard
Compare word error rate, speed, and pricing across Speech to Text models and providers. Our comprehensive analysis helps you choose the best Speech to Text model for your specific use case and requirements.
For further details, see our methodology page.
Artificial Analysis Word Error Rate (AA-WER) Index
Artificial Analysis Word Error Rate (AA-WER) Index
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.
AA-WER by Dataset
AA-WER: AA-AgentTalk Dataset
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.
Cleaned Dataset Comparison
VoxPopuli: Cleaned vs Original Subset of Publicly Available Data
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.
API Benchmarks
Artificial Analysis Word Error Rate Index vs. Price
Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.
AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Speed Factor
Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).
Price of Transcription
Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.
Note: Groq chargers for a minimum of 10s per request.
| Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
|---|---|---|---|---|---|---|---|
| Whisper Large v2 | large-v2 | 4.2% | 29.7 | 6.00 | |||
| Wizper Large v3 | large-v3 | 4.9% | 224.1 | 0.50 | |||
| Incredibly Fast Whisper | large-v3 | 5.8% | 56.2 | 1.49 | |||
| Whisper Large v3 | large-v3 | 10.2% | 2.8 | 4.23 | |||
| Whisper Large v3 | large-v3 | 4.3% | 50.3 | 1.15 | |||
| Whisper Large v3 Turbo | v3 Turbo | 4.8% | 387.5 | 0.67 | |||
| Whisper Large v3 | large-v3 | 4.8% | 184.3 | 1.00 | |||
| Whisper Large v3 Turbo | v3 Turbo | 4.8% | 387.8 | 1.00 | |||
| Whisper Large v3 | large-v3 | 7.4% | 122.1 | 1.50 | |||
| Speechmatics Standard | 5.3% | 45.0 | 4.00 | ||||
| Speechmatics Enhanced | 4.3% | 24.8 | 6.70 | ||||
| Nova-2 | 5.6% | 425.8 | 4.30 | ||||
| Base | 10.9% | 502.5 | 12.50 | ||||
| Nova-3 | 6.5% | 241.6 | 4.30 | ||||
| Universal, AssemblyAI | 4.0% | 118.1 | 2.50 | ||||
| Slam-1 | 4.1% | 79.1 | 4.50 | ||||
| Universal-3 Pro | 3.3% | 71.1 | 3.50 | ||||
| Amazon Transcribe | 4.3% | 18.5 | 24.00 | ||||
| Chirp | 31.3% | 14.0 | 16.00 | ||||
| Chirp 2, Google | 6.0% | 19.4 | 16.00 | ||||
| Chirp 3, Google | 4.6% | 23.0 | 16.00 | ||||
| Scribe v1 | 3.2% | 34.7 | 6.67 | ||||
| Scribe v2 | 2.3% | 34.0 | 6.67 | ||||
| Gemini 2.0 Flash | 4.0% | 50.2 | 1.40 | ||||
| Gemini 2.0 Flash Lite | 4.0% | 49.6 | 0.19 | ||||
| Gemini 2.5 Flash Lite | 5.3% | 72.1 | 0.58 | ||||
| Gemini 2.5 Flash | 5.3% | 55.9 | 1.92 | ||||
| Gemini 2.5 Pro | 3.1% | 12.3 | 4.80 | ||||
| Gemini 3 Pro | 2.9% | 5.4 | 7.68 | ||||
| Gemini 3 Flash | 3.1% | 12.3 | 1.92 | ||||
| GPT-4o Transcribe | 4.1% | 33.0 | 6.00 | ||||
| GPT-4o Mini Transcribe | 4.6% | 49.8 | 3.00 | ||||
| Parakeet RNNT 1.1B | 5.0% | 6.2 | 1.91 | ||||
| Parakeet TDT 0.6B V2, NVIDIA | 6.8% | 58.8 | 0.00 | ||||
| Canary Qwen 2.5B, NVIDIA | 4.4% | 5.8 | 0.74 | ||||
| Voxtral Mini | 3.7% | 68.5 | 1.00 | ||||
| Voxtral Small | 3.0% | 67.3 | 4.00 | ||||
| Voxtral Mini | 4.0% | 82.4 | 1.00 | ||||
| Solaria-1, Gladia | 4.2% | 51.3 | 8.33 | ||||
| Nova 2 Omni | 5.9% | 35.1 | 1.85 | ||||
| Nova 2 Pro | 5.0% | 23.3 | 3.10 |
Frequently Asked Questions
Common questions about Speech to Text models and providers
Scribe v2, ElevenLabs leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 2.3% across 43 models evaluated.
The top speech to text models by accuracy (AA-WER) are: 1. Scribe v2, ElevenLabs (2.3%), 2. Gemini 3 Pro, Google (2.9%), 3. Voxtral Small, Mistral (3.0%), 4. Gemini 2.5 Pro, Google (3.1%), 5. Gemini 3 Flash, Google (3.1%). Lower AA-WER indicates better transcription accuracy.
Base is the fastest with a speed factor of 502.5x real-time, followed by Nova-2 (425.8x) and Whisper (L, v3, Turbo), Fireworks (387.8x). Higher speed factors mean faster transcription.
Gemini 2.0 Flash Lite is the most affordable at $0.19 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Gemini 2.5 Flash Lite ($0.576).
Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 3.0%. There are 12 open weights models out of 43 total evaluated.
The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 3.0%), 2. Voxtral Mini Transcribe 2, Mistral (AA-WER 3.6%), 3. Voxtral Mini, Mistral (AA-WER 3.7%).
The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.
Speech to Text models & providers compared: Whisper Large v2, Standard, Enhanced, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Nova-2, Base, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper (L, v3, Turbo), Groq, Whisper (L, v3), Fireworks, Whisper (L, v3, Turbo), Fireworks, Universal, Amazon Transcribe, Nova-3, Chirp, Chirp 2, Scribe v1, Gemini 2.0 Flash, Gemini 2.0 Flash Lite, GPT-4o Transcribe, GPT-4o Mini Transcribe, Parakeet RNNT 1.1B, Replicate, Whisper Large v3, together.ai, Voxtral Mini, Voxtral Small, Voxtral Mini, Deepinfra, Parakeet TDT 0.6B V2, Canary Qwen 2.5B, Replicate, Slam-1, Gemini 2.5 Flash Lite, Gemini 2.5 Flash, Gemini 2.5 Pro, Chirp 3, Solaria-1, Scribe v2, Nova 2 Omni, Nova 2 Pro, Gemini 3 Pro, Gemini 3 Flash, and Universal-3 Pro.