Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers.

For further details, see our methodology page.

Highlights

AA-WER v2 · % of words transcribed incorrectly · Lower is better
Input audio seconds transcribed per second · Higher is better
USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate Index (Non-streaming)

Artificial Analysis Word Error Rate Index (Non-streaming)

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)
Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Mini Transcribe, OpenAI; GPT-4o Transcribe, OpenAI; Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Qwen3 ASR Flash, Alibaba; Parakeet TDT 0.6B V3, NVIDIA; Canary Qwen 2.5B, NVIDIA).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

AA-WER (Non-streaming) by Dataset

AA-WER (Non-streaming): AA-AgentTalk Dataset

% of words transcribed incorrectly on the AA-AgentTalk dataset · Lower is better

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

% WER (word error rate) · Lower is better
Sort by
Note: The cleaned versions remove transcription errors from the reference text, providing a more accurate ground truth for model evaluation.

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

API Benchmarks

Artificial Analysis Word Error Rate Index (Non-streaming) vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio
Most attractive quadrant

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Summary of Key Metrics & Further Information

Provider
Further Details
Qwen3.5 Omni Flash
Qwen3.5 Omni Flash logoAlibaba Cloud
13.5%
81.6
0.00
Qwen3.5 Omni Plus
Qwen3.5 Omni Plus logoAlibaba Cloud
3.5%
94.9
0.00
Nova 2 Pro
Nova 2 Pro logoAmazon Bedrock
4.9%
22.6
3.10
Amazon Transcribe
Amazon Transcribe logoAmazon Bedrock
4.1%
19.0
24.00
Universal-3 Pro
Universal-3 Pro logoAssemblyAI
3.1%
90.2
3.50
Universal, AssemblyAI
Universal, AssemblyAI logoAssemblyAI
3.8%
112.7
2.50
MAI-Transcribe-1
MAI-Transcribe-1 logoMicrosoft Azure
2.6%
55.3
6.00
Nova-3
Nova-3 logoDeepgram
5.2%
445.2
4.30
Nova-2
Nova-2 logoDeepgram
5.3%
491.2
4.30
Base
Base logoDeepgram
10.7%
339.9
12.50
Scribe v2
Scribe v2 logoElevenLabs
2.2%
39.2
3.67
Scribe v1
Scribe v1 logoElevenLabs
3.0%
41.4
6.67
Solaria-1, Gladia
Solaria-1, Gladia logoGladia
4.1%
60.2
4.07
Gemini 3.1 Pro Preview (High)
Gemini 3.1 Pro Preview (High) logoGoogle
2.8%
6.4
18.15
Gemini 3.1 Pro Preview (Low)
Gemini 3.1 Pro Preview (Low) logoGoogle
3.6%
7.1
7.72
Gemini 3 Flash (High)
Gemini 3 Flash (High) logoGoogle
2.9%
17.6
13.70
Gemini 3 Pro (High)
Gemini 3 Pro (High) logoGoogle
2.7%
9.2
18.40
Gemini 2.5 Flash Lite
Gemini 2.5 Flash Lite logoGoogle
5.2%
67.7
6.56
Gemini 2.5 Flash
Gemini 2.5 Flash logoGoogle
5.1%
66.3
6.66
Gemini 2.5 Pro
Gemini 2.5 Pro logoGoogle
2.9%
13.3
11.39
Gemini 2.0 Flash Lite
Gemini 2.0 Flash Lite logoGoogle
3.8%
55.9
0.19
Gemini 3.1 Flash-Lite Preview (Minimal)
Gemini 3.1 Flash-Lite Preview (Minimal) logoGoogle
3.4%
67.5
5.83
Gradium Speech-to-Text
Gradium Speech-to-Text logoGradium
8.4%
2.3
13.00
Voxtral Mini Transcribe 2
Voxtral Mini Transcribe 2 logoMistral
3.6%
71.5
3.00
Voxtral Mini Transcribe
Voxtral Mini Transcribe logoMistral
3.5%
54.6
2.00
Voxtral Small
Voxtral Small logoMistral
2.8%
66.2
4.00
Voxtral Mini
Voxtral Mini logoDeepInfra
3.8%
69.1
1.00
Velma-2 STT Batch English VFast
Velma-2 STT Batch English VFast logoModulate
5.9%
201.6
0.00
Parakeet TDT 0.6B V3, Togetherai
Parakeet TDT 0.6B V3, Togetherai logoTogether.ai
4.5%
905.2
1.50
Canary Qwen 2.5B, NVIDIA
Canary Qwen 2.5B, NVIDIA logoReplicate
4.3%
5.4
0.74
Parakeet TDT 0.6B V2, NVIDIA
Parakeet TDT 0.6B V2, NVIDIA logoNVIDIA
6.4%
101.9
0.00
Parakeet RNNT 1.1B
Parakeet RNNT 1.1B logoReplicate
5.4%
5.9
1.91
GPT-4o Transcribe
GPT-4o Transcribe logoOpenAI
4.0%
31.8
6.00
GPT-4o Mini Transcribe
GPT-4o Mini Transcribe logoOpenAI
4.5%
44.4
3.00
Rev AI
Rev AI logoRev AI
5.9%
12.6
3.33
Smallest Pulse
Smallest Pulse logoSmallest.ai
4.4%
140.2
5.00
Speechmatics Standard
Speechmatics Standard logoSpeechmatics
5.1%
69.9
4.00
Speechmatics Enhanced
Speechmatics Enhanced logoSpeechmatics
4.0%
53.8
6.70
Whisper Large v3 Turbo
Whisper Large v3 Turbo logoGroq
4.6%
197.4
0.67
Whisper Large v3 Turbo
Whisper Large v3 Turbo logoFireworks
4.7%
227.2
1.00
Wizper Large v3
Wizper Large v3 logofal.ai
4.7%
268.0
0.50
Incredibly Fast Whisper
Incredibly Fast Whisper logoReplicate
5.7%
55.8
1.49
Whisper Large v3
Whisper Large v3 logoReplicate
10.1%
2.7
4.23
Whisper Large v3
Whisper Large v3 logofal.ai
4.1%
80.4
1.15
Whisper Large v3
Whisper Large v3 logoFireworks
4.6%
303.5
1.00
Whisper Large v3
Whisper Large v3 logoTogether.ai
4.5%
435.2
1.50
Whisper Large v2
Whisper Large v2 logoOpenAI
4.1%
27.5
6.00

Frequently Asked Questions

Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.7% across 49 models evaluated.

The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.7%), 2. Scribe v2, ElevenLabs (2.2%), 3. MAI-Transcribe-1 (2.6%), 4. Gemini 3 Pro (High), Google (2.7%), 5. Voxtral Small, Mistral (2.8%). Lower AA-WER indicates better transcription accuracy.

Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 905.2x real-time, followed by Nova-2 (491.2x) and Nova-3 (445.2x). Higher speed factors mean faster transcription.

Gemini 2.0 Flash Lite is the most affordable at $0.19 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Whisper (L, v3, Turbo), Groq ($0.667).

Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.8%. There are 12 open weights models out of 49 total evaluated.

The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.8%), 2. Voxtral Mini Transcribe, Mistral (AA-WER 3.5%), 3. Voxtral Mini Transcribe 2, Mistral (AA-WER 3.6%).

The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.

Speech to Text models & providers compared: MAI-Transcribe-1, Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Pulse STT, Voxtral Mini Transcribe 2, Universal-3 Pro, Scribe v2, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, Gemini 3 Pro (High), Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Mini Transcribe, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Scribe v1, Gemini 2.0 Flash Lite, Nova-3, Universal, Whisper (L, v3, Turbo), Groq, Whisper (L, v3, Turbo), Fireworks, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, Wizper (L, v3), fal.ai, Incredibly Fast, Replicate, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper (L, v3), Fireworks, Whisper Large v3, together.ai, Nova-2, Standard, Enhanced, Large v2, Base, Rev AI, Gemini 3.1 Flash-Lite Preview (Minimal), Velma-2 STT Batch English VFast.