Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers. Our comprehensive analysis helps you choose the best Speech to Text model for your specific use case and requirements.

For further details, see our methodology page.

Highlights

Word Error Rate Index

AA-WER v2 · % of words transcribed incorrectly · Lower is better

Speed Factor

Input audio seconds transcribed per second · Higher is better

Price

USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate (AA-WER) Index

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)

Not yet publicly available (coming soon)

Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Mini Transcribe, OpenAI; GPT-4o Transcribe, OpenAI; Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA; Parakeet TDT 0.6B V3, NVIDIA; Qwen3 ASR Flash, Alibaba).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

AA-WER by Dataset

AA-WER: AA-AgentTalk Dataset

% of words transcribed incorrectly on the AA-AgentTalk dataset · Lower is better

Not yet publicly available (coming soon)

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

% WER (word error rate) · Lower is better

Sort by

Note: The cleaned versions remove transcription errors from the reference text, providing a more accurate ground truth for model evaluation.

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio

Most attractive quadrant

Amazon Bedrock

AssemblyAI

Deepgram

ElevenLabs

fal.ai

Fireworks

Gladia

Google

Mistral

OpenAI

Replicate

Rev AI

Smallest.ai

Soniox

Speechmatics

Together.ai

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price of Transcription

USD per 1000 minutes of audio

Summary of Key Metrics & Further Information

Provider	Model	Whisper version	Word Error Rate (%)	Median Speed Factor	Price (USD per 1000 minutes)	Further Details
Alibaba Cloud	Qwen3.5 Omni Flash		13.6%	94.3	0.00	Details
Alibaba Cloud	Qwen3.5 Omni Plus		3.7%	100.0	0.00	Details
Google	Gemini 3.1 Pro Preview (High)		2.9%	5.9	18.15	Details
Google	Gemini 3.1 Pro Preview (Low)		3.8%	6.8	7.72	Details
Google	Gemini 3 Flash (High)		3.1%	16.7	13.70	Details
Google	Gemini 3 Pro (High)		2.9%	6.2	18.40	Details
Google	Gemini 2.5 Flash Lite		5.3%	67.7	6.56	Details
Google	Gemini 2.5 Flash		5.2%	69.4	6.66	Details
Google	Gemini 2.5 Pro		3.0%	12.4	11.39	Details
Google	Gemini 2.0 Flash Lite		3.9%	53.1	0.19	Details
Google	Gemini 2.0 Flash		3.9%	52.8	1.40	Details
Google	Gemini 3.1 Flash-Lite Preview (Minimal)		3.5%	75.0	5.83	Details
Smallest.ai	Pulse STT		4.5%	144.6	5.00	Details
Mistral	Voxtral Mini Transcribe 2		3.7%	76.8	3.00	Details
Mistral	Voxtral Mini Transcribe		3.7%	52.8	1.00	Details
Mistral	Voxtral Small		2.9%	64.7	4.00	Details
DeepInfra	Voxtral Mini		4.0%	79.8	1.00	Details
AssemblyAI	Universal-3 Pro		3.3%	100.0	3.50	Details
AssemblyAI	Universal, AssemblyAI		3.9%	98.9	2.50	Details
Soniox	Soniox V4		4.0%	24.4	1.66	Details
ElevenLabs	Scribe v2		2.2%	31.1	6.67	Details
ElevenLabs	Scribe v1		3.1%	38.6	6.67	Details
Amazon Bedrock	Nova 2 Pro		5.0%	24.0	3.10	Details
Gradium	Gradium Speech-to-Text		8.5%	2.3	13.00	Details
Together.ai	Parakeet TDT 0.6B V3, Togetherai		4.6%	905.1	1.50	Details
Replicate	Canary Qwen 2.5B, NVIDIA		4.3%	5.5	0.74	Details
NVIDIA	Parakeet TDT 0.6B V2, NVIDIA		6.5%	93.8	0.00	Details
Replicate	Parakeet RNNT 1.1B		5.5%	5.4	1.91	Details
Gladia	Solaria-1, Gladia		4.2%	53.2	4.07	Details
OpenAI	GPT-4o Transcribe		4.1%	31.2	6.00	Details
OpenAI	GPT-4o Mini Transcribe		4.6%	52.2	3.00	Details
Deepgram	Nova-3		5.3%	116.2	4.30	Details
Deepgram	Nova-2		5.4%	579.5	4.30	Details
Deepgram	Base		10.8%	621.9	12.50	Details
Groq	Whisper Large v3 Turbo	v3 Turbo	4.8%	204.1	0.67	Details
Fireworks	Whisper Large v3 Turbo	v3 Turbo	4.8%	287.3	1.00	Details
fal.ai	Wizper Large v3	large-v3	4.8%	185.8	0.50	Details
Replicate	Incredibly Fast Whisper	large-v3	5.8%	50.9	1.49	Details
Replicate	Whisper Large v3	large-v3	10.2%	2.8	4.23	Details
fal.ai	Whisper Large v3	large-v3	4.2%	53.7	1.15	Details
Fireworks	Whisper Large v3	large-v3	4.7%	128.1	1.00	Details
Together.ai	Whisper Large v3	large-v3	4.7%	308.5	1.50	Details
OpenAI	Whisper Large v2	large-v2	4.2%	28.5	6.00	Details
Amazon Bedrock	Amazon Transcribe		4.2%	18.3	24.00	Details
Speechmatics	Speechmatics Standard		5.1%	71.6	4.00	Details
Speechmatics	Speechmatics Enhanced		4.1%	68.1	6.70	Details
Rev AI	Rev AI		6.0%	12.7	20.00	Details

Frequently Asked Questions

Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.8% across 51 models evaluated.

The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.8%), 2. Scribe v2, ElevenLabs (2.2%), 3. Gemini 3 Pro (High), Google (2.9%), 4. Voxtral Small, Mistral (2.9%), 5. Gemini 3.1 Pro Preview (High) (2.9%). Lower AA-WER indicates better transcription accuracy.

Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 905.1x real-time, followed by Base (621.9x) and Nova-2 (579.5x). Higher speed factors mean faster transcription.

Gemini 2.0 Flash Lite is the most affordable at $0.19 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Whisper (L, v3, Turbo), Groq ($0.667).

Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.9%. There are 11 open weights models out of 51 total evaluated.

The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.9%), 2. Parakeet TDT 0.6B V3, NVIDIA (AA-WER 4.2%), 3. Whisper Large v2, OpenAI (AA-WER 4.2%).

The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.

Speech to Text models & providers compared: Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Pulse STT, Voxtral Mini Transcribe 2, Universal-3 Pro, Soniox V4, Scribe v2, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, Gemini 3 Pro (High), Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Mini Transcribe, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Scribe v1, Gemini 2.0 Flash Lite, Nova-3, Gemini 2.0 Flash, Universal, Whisper (L, v3, Turbo), Groq, Whisper (L, v3, Turbo), Fireworks, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper (L, v3), Fireworks, Whisper Large v3, together.ai, Nova-2, Standard, Enhanced, Whisper Large v2, Base, Rev AI, Gemini 3.1 Flash-Lite Preview (Minimal).

Speech to Text AI Model & Provider Leaderboard

Word Error Rate Index

Speed Factor

Price

Artificial Analysis Word Error Rate (AA-WER) Index

Artificial Analysis Word Error Rate (AA-WER) Index

Artificial Analysis Word Error Rate (AA-WER) Index

AA-WER by Dataset

AA-WER: AA-AgentTalk Dataset

Artificial Analysis Word Error Rate (AA-WER) Index

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

Artificial Analysis Word Error Rate (AA-WER) Index

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

Artificial Analysis Word Error Rate (AA-WER) Index

Price

Speed Factor

Speed Factor

Price of Transcription

Price

Summary of Key Metrics & Further Information

Frequently Asked Questions

Which is the most accurate speech to text model?

What are the top speech to text models?

Which is the fastest speech to text model?

Which is the cheapest speech to text model?

Which is the best open weights speech to text model?

What are the top open weights speech to text models?

How do I choose the best speech to text model?