Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers. Our comprehensive analysis helps you choose the best Speech to Text model for your specific use case and requirements.

For further details, see our methodology page.

Highlights

AA-WER v2 · % of words transcribed incorrectly · Lower is better
Input audio seconds transcribed per second · Higher is better
USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate (AA-WER) Index

Artificial Analysis Word Error Rate (AA-WER) Index

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)
Not yet publicly available (coming soon)
Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Mini Transcribe, OpenAI; GPT-4o Transcribe, OpenAI; Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA; Parakeet TDT 0.6B V3, NVIDIA; Qwen3 ASR Flash, Alibaba).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

AA-WER by Dataset

AA-WER: AA-AgentTalk Dataset

% of words transcribed incorrectly on the AA-AgentTalk dataset · Lower is better
Not yet publicly available (coming soon)

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

% WER (word error rate) · Lower is better
Sort by
Note: The cleaned versions remove transcription errors from the reference text, providing a more accurate ground truth for model evaluation.

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio
Most attractive quadrant
Amazon Bedrock
AssemblyAI
Deepgram
ElevenLabs
fal.ai
Fireworks
Gladia
Google
Mistral
OpenAI
Replicate
Rev AI
Smallest.ai
Soniox
Speechmatics
Together.ai

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Summary of Key Metrics & Further Information

ProviderFurther
Details
Qwen3.5 Omni Flash logoAlibaba Cloud
Qwen3.5 Omni Plus logoAlibaba Cloud
Gemini 3.1 Pro Preview (High) logoGoogle
Gemini 3.1 Pro Preview (Low) logoGoogle
Gemini 3 Flash (High) logoGoogle
Gemini 3 Pro (High) logoGoogle
Gemini 2.5 Flash Lite logoGoogle
Gemini 2.5 Flash logoGoogle
Gemini 2.5 Pro logoGoogle
Gemini 2.0 Flash Lite logoGoogle
Gemini 2.0 Flash logoGoogle
Gemini 3.1 Flash-Lite Preview (Minimal) logoGoogle
Pulse STT logoSmallest.ai
Voxtral Mini Transcribe 2 logoMistral
Voxtral Mini Transcribe logoMistral
Voxtral Small logoMistral
Voxtral Mini logoDeepInfra
Universal-3 Pro logoAssemblyAI
Universal, AssemblyAI logoAssemblyAI
Soniox V4 logoSoniox
Scribe v2 logoElevenLabs
Scribe v1 logoElevenLabs
Nova 2 Pro logoAmazon Bedrock
Gradium Speech-to-Text logoGradium
Parakeet TDT 0.6B V3, Togetherai logoTogether.ai
Canary Qwen 2.5B, NVIDIA logoReplicate
Parakeet TDT 0.6B V2, NVIDIA logoNVIDIA
Parakeet RNNT 1.1B logoReplicate
Solaria-1, Gladia logoGladia
GPT-4o Transcribe logoOpenAI
GPT-4o Mini Transcribe logoOpenAI
Nova-3 logoDeepgram
Nova-2 logoDeepgram
Base logoDeepgram
Whisper Large v3 Turbo logoGroq
Whisper Large v3 Turbo logoFireworks
Wizper Large v3 logofal.ai
Incredibly Fast Whisper logoReplicate
Whisper Large v3 logoReplicate
Whisper Large v3 logofal.ai
Whisper Large v3 logoFireworks
Whisper Large v3 logoTogether.ai
Whisper Large v2 logoOpenAI
Amazon Transcribe logoAmazon Bedrock
Speechmatics Standard logoSpeechmatics
Speechmatics Enhanced logoSpeechmatics
Rev AI logoRev AI

Frequently Asked Questions

Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.8% across 51 models evaluated.

The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.8%), 2. Scribe v2, ElevenLabs (2.2%), 3. Gemini 3 Pro (High), Google (2.9%), 4. Voxtral Small, Mistral (2.9%), 5. Gemini 3.1 Pro Preview (High) (2.9%). Lower AA-WER indicates better transcription accuracy.

Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 905.1x real-time, followed by Base (621.9x) and Nova-2 (579.5x). Higher speed factors mean faster transcription.

Gemini 2.0 Flash Lite is the most affordable at $0.19 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Whisper (L, v3, Turbo), Groq ($0.667).

Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.9%. There are 11 open weights models out of 51 total evaluated.

The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.9%), 2. Parakeet TDT 0.6B V3, NVIDIA (AA-WER 4.2%), 3. Whisper Large v2, OpenAI (AA-WER 4.2%).

The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.

Speech to Text models & providers compared: Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Pulse STT, Voxtral Mini Transcribe 2, Universal-3 Pro, Soniox V4, Scribe v2, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, Gemini 3 Pro (High), Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Mini Transcribe, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Scribe v1, Gemini 2.0 Flash Lite, Nova-3, Gemini 2.0 Flash, Universal, Whisper (L, v3, Turbo), Groq, Whisper (L, v3, Turbo), Fireworks, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper (L, v3), Fireworks, Whisper Large v3, together.ai, Nova-2, Standard, Enhanced, Whisper Large v2, Base, Rev AI, Gemini 3.1 Flash-Lite Preview (Minimal).