Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers.

For further details, see our methodology page.

See Streaming Benchmarks

Text to Speech Arena

AI Speech Explorer

Highlights

WER Index (Non-streaming)

AA-WER v2 · % of words transcribed incorrectly · Lower is better

Speed Factor

Input audio seconds transcribed per second · Higher is better

Price

USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate Index (Non-streaming)

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)

Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Mini Transcribe, OpenAI; Nova 2 Pro, Amazon; GPT-4o Transcribe, OpenAI). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA; Qwen3 ASR Flash, Alibaba).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

AA-WER (Non-streaming) by Dataset

AA-WER (Non-streaming): AA-AgentTalk Dataset

% of words transcribed incorrectly on the AA-AgentTalk dataset · Lower is better

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

% WER (word error rate) · Lower is better

Sort by

Note: The cleaned versions remove transcription errors from the reference text, providing a more accurate ground truth for model evaluation.

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

API Benchmarks

Artificial Analysis Word Error Rate Index (Non-streaming) vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio

Most attractive quadrant

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Summary of Key Metrics & Further Information

	Provider	Whisper Version				Further Details
Qwen3.5 Omni Flash	Alibaba Cloud		13.5%	79.2	0.00	Details
Qwen3.5 Omni Plus	Alibaba Cloud		3.5%	96.6	0.00	Details
Nova 2 Pro	Amazon Bedrock		4.9%	22.7	3.10	Details
Amazon Transcribe	Amazon Bedrock		4.1%	14.3	6.00	Details
Universal-3 Pro	AssemblyAI		3.1%	112.8	3.50	Details
Universal, AssemblyAI	AssemblyAI		3.8%	124.8	2.50	Details
MAI-Transcribe-1.5	Microsoft Azure		2.4%	204.0	6.00	Details
MAI-Transcribe-1	Microsoft Azure		2.6%	67.3	6.00	Details
transcribe-03-2026	Cohere		4.6%	126.3	0.00	Details
Nova-3	Deepgram		5.2%	358.3	4.30	Details
Scribe v2	ElevenLabs		2.2%	44.6	3.67	Details
Solaria-1, Gladia	Gladia		4.1%	81.3	10.17	Details
Solaria-3, Gladia	Gladia		3.2%	61.8	10.16	Details
Gemini 3.1 Pro Preview (High)	Google		2.8%	7.5	18.15	Details
Gemini 3.1 Pro Preview (Low)	Google		3.6%	7.2	7.72	Details
Gemini 3 Flash (High)	Google		2.9%	17.6	13.70	Details
Gemini 2.5 Flash Lite	Google		5.2%	64.0	6.56	Details
Gemini 2.5 Flash	Google		5.1%	75.5	6.66	Details
Gemini 2.5 Pro	Google		2.9%	12.4	11.39	Details
Gemini 3.1 Flash-Lite Preview (Minimal)	Google		3.4%	76.0	5.83	Details
Gradium Speech-to-Text	Gradium		6.8%	2.2	13.00	Details
Grok Speech to Text, SpaceXAI	SpaceXAI		4.0%	123.0	1.67	Details
Voxtral Mini Transcribe 2	Mistral		3.6%	76.8	3.00	Details
Voxtral Small	Mistral		2.8%	66.2	4.00	Details
Voxtral Mini	DeepInfra		3.8%	79.9	1.00	Details
Modulate STT Batch English VFast	Modulate		4.2%	68.6	0.42	Details
Parakeet TDT 0.6B V3, Togetherai	Together AI		4.5%	885.0	1.50	Details
Canary Qwen 2.5B, NVIDIA	Replicate		4.3%	5.7	0.74	Details
Parakeet TDT 0.6B V2, NVIDIA	NVIDIA		6.4%	88.0	0.00	Details
Parakeet RNNT 1.1B	Replicate		5.4%	6.4	1.91	Details
GPT Transcribe, OpenAI	OpenAI		3.3%	33.6	4.50	Details
GPT-4o Transcribe	OpenAI		4.0%	36.7	6.00	Details
GPT-4o Mini Transcribe	OpenAI		4.5%	40.8	3.00	Details
Smallest AI Pulse Pro	Smallest.ai		2.4%	251.8	4.00	Details
Resonant-1	Reson8		3.4%	334.8	3.60	Details
Rev AI	Rev AI		5.9%	12.1	3.33	Details
Smallest AI Pulse	Smallest.ai		4.4%	231.2	5.00	Details
Soniox v5 Async	Soniox		3.8%	20.7	1.66	Details
Soniox V4	Soniox		3.9%	18.8	1.66	Details
Speechmatics Melia	Speechmatics		4.9%	73.7	4.00	Details
Speechmatics Standard	Speechmatics		5.1%	38.6	7.50	Details
Speechmatics Enhanced	Speechmatics		4.0%	32.8	12.50	Details
Whisper Large v3 Turbo	Groq	v3 Turbo	4.6%	106.6	0.67	Details
Wizper Large v3	fal.ai	large-v3	4.7%	234.0	0.50	Details
Incredibly Fast Whisper	Replicate	large-v3	5.7%	57.6	1.49	Details
Whisper Large v3	Replicate	large-v3	10.1%	2.8	4.23	Details
Whisper Large v3	fal.ai	large-v3	4.1%	50.5	1.15	Details
Whisper Large v3	Together AI	large-v3	4.5%	491.3	1.50	Details
Whisper Large v2	OpenAI	large-v2	4.1%	27.6	6.00	Details

Frequently Asked Questions

Fun-Realtime-ASR-preview leads with the lowest AA-WER (Artificial Analysis Word Error Rate) of 1.7% across 55 models evaluated.

The top speech to text models by accuracy (AA-WER) are: 1. Fun-Realtime-ASR-preview (1.7%), 2. Scribe v2, ElevenLabs (2.2%), 3. MAI-Transcribe-1.5 (2.4%), 4. Smallest AI Pulse Pro (2.4%), 5. MAI-Transcribe-1 (2.6%). Lower AA-WER indicates better transcription accuracy.

Parakeet TDT 0.6B V3, Togetherai is the fastest with a speed factor of 885.0x real-time, followed by Whisper Large v3, together.ai (491.3x) and Nova-3 (358.3x). Higher speed factors mean faster transcription.

Modulate STT Batch English VFast is the most affordable at $0.417 per 1,000 minutes, followed by Wizper (L, v3), fal.ai ($0.50) and Whisper (L, v3, Turbo), Groq ($0.667).

Voxtral Small, Mistral is the most accurate open weights model with an AA-WER of 2.8%. There are 12 open weights models out of 55 total evaluated.

The top open weights speech to text models by accuracy are: 1. Voxtral Small, Mistral (AA-WER 2.8%), 2. Inkling (256K), Thinking Machines (AA-WER 3.5%), 3. Voxtral Mini Transcribe 2, Mistral (AA-WER 3.6%).

The best model depends on your priorities. Use the scatter plots to visualize trade-offs between accuracy (AA-WER), speed, and price. For applications requiring high accuracy, prioritize models with lower AA-WER scores. For real-time applications, focus on speed factor. For cost-sensitive workloads, compare the price charts.

Speech to Text models & providers compared: Melia, Soniox v5 Async, Solaria-3, MAI-Transcribe-1.5, MAI-Transcribe-1, Grok Speech to Text, SpaceXAI, Qwen3.5 Omni Flash, Qwen3.5 Omni Plus, Cohere Transcribe 03-2026, Gemini 3.1 Pro Preview (High), Gemini 3.1 Pro Preview (Low), Smallest AI Pulse, Voxtral Mini Transcribe 2, Universal-3 Pro, Soniox V4, Scribe v2, Smallest AI Pulse Pro, Gemini 3 Flash (High), Nova 2 Pro, Gradium Speech-to-Text, Parakeet TDT 0.6B V3, Togetherai, Gemini 2.5 Flash Lite, Canary Qwen 2.5B, Replicate, Voxtral Small, Voxtral Mini, Deepinfra, Gemini 2.5 Flash, Gemini 2.5 Pro, Parakeet TDT 0.6B V2, Solaria-1, GPT-4o Transcribe, GPT-4o Mini Transcribe, Nova-3, Universal, Whisper (L, v3, Turbo), Groq, Parakeet RNNT 1.1B, Replicate, Amazon Transcribe, Wizper (L, v3), fal.ai, Incredibly Fast, Replicate, Whisper (L, v3), Replicate, Whisper (L, v3), fal.ai, Whisper Large v3, together.ai, Standard, Enhanced, Large v2, Rev AI, GPT Transcribe, Gemini 3.1 Flash-Lite Preview (Minimal), Modulate STT Batch English VFast, Resonant-1.

Speech to Text AI Model & Provider Leaderboard

Related Links

WER Index (Non-streaming)

Speed Factor

Price

Artificial Analysis Word Error Rate Index (Non-streaming)

Artificial Analysis Word Error Rate Index (Non-streaming)

Artificial Analysis Word Error Rate (AA-WER) Index

AA-WER (Non-streaming) by Dataset

AA-WER (Non-streaming): AA-AgentTalk Dataset

Artificial Analysis Word Error Rate (AA-WER) Index

Cleaned Dataset Comparison

VoxPopuli: Cleaned vs Original Subset of Publicly Available Data

Artificial Analysis Word Error Rate (AA-WER) Index

API Benchmarks

Artificial Analysis Word Error Rate Index (Non-streaming) vs. Price

Artificial Analysis Word Error Rate (AA-WER) Index

Speed Factor

Speed Factor

Price of Transcription

Price

Summary of Key Metrics & Further Information

Frequently Asked Questions

Which is the most accurate speech to text model?

What are the top speech to text models?

Which is the fastest speech to text model?

Which is the cheapest speech to text model?

Which is the best open weights speech to text model?

What are the top open weights speech to text models?

How do I choose the best speech to text model?