Speech to Text AI Model & Provider Leaderboard
Compare word error rate, speed, and pricing across Speech to Text models and providers.
For further details, see our methodology page.
You may also be interested in...
Highlights
Artificial Analysis Word Error Rate (AA-WER) Index
Artificial Analysis Word Error Rate (AA-WER) Index
AA-WER by Dataset
AA-WER: AA-AgentTalk Dataset
Cleaned Dataset Comparison
VoxPopuli: Cleaned vs Original Subset of Publicly Available Data
API Benchmarks
Artificial Analysis Word Error Rate Index vs. Price
Speed Factor
Price of Transcription
Summary of Key Metrics & Further Information
| Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
|---|---|---|---|---|---|---|---|
| Qwen3.5 Omni Flash | 13.6% | 106.4 | 0.00 | ||||
| Qwen3.5 Omni Plus | 3.7% | 145.1 | 0.00 | ||||
| Gemini 3.1 Pro Preview (High) | 2.9% | 6.9 | 18.15 | ||||
| Gemini 3.1 Pro Preview (Low) | 3.8% | 6.5 | 7.72 | ||||
| Gemini 3 Flash (High) | 3.1% | 14.7 | 13.70 | ||||
| Gemini 3 Pro (High) | 2.9% | 6.6 | 18.40 | ||||
| Gemini 2.5 Flash Lite | 5.3% | 68.1 | 6.56 | ||||
| Gemini 2.5 Flash | 5.2% | 70.5 | 6.66 | ||||
| Gemini 2.5 Pro | 3.0% | 12.7 | 11.39 | ||||
| Gemini 2.0 Flash Lite | 3.9% | 55.8 | 0.19 | ||||
| Gemini 2.0 Flash | 3.9% | 55.3 | 1.40 | ||||
| Gemini 3.1 Flash-Lite Preview (Minimal) | 3.5% | 72.8 | 5.83 | ||||
| Pulse STT | 4.5% | 178.9 | 5.00 | ||||
| Voxtral Mini Transcribe 2 | 3.7% | 75.3 | 3.00 | ||||
| Voxtral Mini Transcribe | 3.7% | 52.1 | 1.00 | ||||
| Voxtral Small | 2.9% | 65.5 | 4.00 | ||||
| Voxtral Mini | 4.0% | 72.7 | 1.00 | ||||
| Universal-3 Pro | 3.3% | 90.9 | 3.50 | ||||
| Universal, AssemblyAI | 3.9% | 117.1 | 2.50 | ||||
| Soniox V4 | 4.0% | 37.7 | 1.66 | ||||
| Scribe v2 | 2.2% | 41.4 | 6.67 | ||||
| Scribe v1 | 3.1% | 41.4 | 6.67 | ||||
| Nova 2 Pro | 5.0% | 23.4 | 3.10 | ||||
| Gradium Speech-to-Text | 8.5% | 2.3 | 13.00 | ||||
| Parakeet TDT 0.6B V3, Togetherai | 4.6% | 836.2 | 1.50 | ||||
| Canary Qwen 2.5B, NVIDIA | 4.3% | 5.0 | 0.74 | ||||
| Parakeet TDT 0.6B V2, NVIDIA | 6.5% | 102.0 | 0.00 | ||||
| Parakeet RNNT 1.1B | 5.5% | 5.6 | 1.91 | ||||
| Solaria-1, Gladia | 4.2% | 53.9 | 4.07 | ||||
| GPT-4o Transcribe | 4.1% | 31.3 | 6.00 | ||||
| GPT-4o Mini Transcribe | 4.6% | 43.3 | 3.00 | ||||
| Nova-3 | 5.3% | 269.3 | 4.30 | ||||
| Nova-2 | 5.4% | 486.9 | 4.30 | ||||
| Base | 10.8% | 472.4 | 12.50 | ||||
| Whisper Large v3 Turbo | v3 Turbo | 4.8% | 316.6 | 0.67 | |||
| Whisper Large v3 Turbo | v3 Turbo | 4.8% | 222.1 | 1.00 | |||
| Wizper Large v3 | large-v3 | 4.8% | 234.3 | 0.50 | |||
| Incredibly Fast Whisper | large-v3 | 5.8% | 58.0 | 1.49 | |||
| Whisper Large v3 | large-v3 | 10.2% | 2.6 | 4.23 | |||
| Whisper Large v3 | large-v3 | 4.2% | 111.1 | 1.15 | |||
| Whisper Large v3 | large-v3 | 4.7% | 33.2 | 1.00 | |||
| Whisper Large v3 | large-v3 | 4.7% | 504.9 | 1.50 | |||
| Whisper Large v2 | large-v2 | 4.2% | 27.1 | 6.00 | |||
| Amazon Transcribe | 4.2% | 18.7 | 24.00 | ||||
| Speechmatics Standard | 5.1% | 44.5 | 4.00 | ||||
| Speechmatics Enhanced | 4.1% | 23.9 | 6.70 | ||||
| Rev AI | 6.0% | 12.6 | 20.00 | ||||
| Velma-2 STT Batch English VFast | 6.0% | 200.9 | 0.00 |
Frequently Asked Questions
Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.8% across 50 models evaluated.
The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.8%), 2. Scribe v2, ElevenLabs (2.2%), 3. Voxtral Small, Mistral (2.9%), 4. Gemini 3 Pro (High), Google (2.9%), 5. Gemini 3.1 Pro Preview (High) (2.9%). Lower AA-WER indicates better transcription accuracy.
Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 836.2x real-time, followed by Large v3, together.ai (504.9x) and Nova-2 (486.9x). Higher speed factors mean faster transcription.
Gemini 2.0 Flash Lite is the most affordable at $0.19 per 1,000 minutes, followed by (L, v3), fal.ai ($0.50) and (L, v3, Turbo), Groq ($0.667).
Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.9%. There are 12 open weights models out of 50 total evaluated.
The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.9%), 2. Voxtral Mini Transcribe, Mistral (AA-WER 3.7%), 3. Voxtral Mini Transcribe 2, Mistral (AA-WER 3.7%).
The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.
Speech to Text models & providers compared: Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Pulse STT, Voxtral Mini Transcribe 2, Universal-3 Pro, Soniox V4, Scribe v2, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, Gemini 3 Pro (High), Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Mini Transcribe, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Scribe v1, Gemini 2.0 Flash Lite, Nova-3, Gemini 2.0 Flash, Universal, (L, v3, Turbo), Groq, (L, v3, Turbo), Fireworks, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, (L, v3), fal.ai, Incredibly Fast, Replicate, (L, v3), Replicate, (L, v3), Fireworks, Large v3, together.ai, Nova-2, Standard, Enhanced, Large v2, Base, Rev AI, Gemini 3.1 Flash-Lite Preview (Minimal), Velma-2 STT Batch English VFast.