Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers. Our comprehensive analysis helps you choose the best Speech to Text model for your specific use case and requirements.

For further details, see our methodology page.

Coming soon:Voice Agents Report

Highlights

Word Error Rate Index
% of words transcribed incorrectly; Lower is better
Speed Factor
Input audio seconds transcribed per second; Higher is better
Price
USD per 1000 minutes of audio; Lower is better

Artificial Analysis Word Error Rate (AA-WER) Index by Model

Artificial Analysis Word Error Rate (AA-WER) Index by Model

% of words transcribed incorrectly, Lower is better
Note: Models that do not support transcription of audio longer than 10 minutes were evaluated on 9-minute chunks of the test set (applies to GPT-4o Transcribe; GPT-4o Mini Transcribe; Voxtral Mini; Voxtral Mini, Deepinfra; Gemini 2.5 Flash Lite). For models with even shorter time limits, all files are split into 30-second chunks (applies to ).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Compared to Whisper, GPT-4o Transcribe smooths transcripts, which results in lower word-for-word accuracy, especially on less structured speech (e.g., meetings, earnings calls).

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly, USD per 1000 minutes of audio
Most attractive quadrant
Amazon Bedrock
AssemblyAI
Deepgram
ElevenLabs
Fireworks
Google
Groq
Hathora
Mistral
OpenAI
Replicate

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Speed Factor

Speed Factor

Input audio seconds transcribed per second, Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Price of Transcription

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.

Note: Groq chargers for a minimum of 10s per request.

Summary of Key Metrics & Further Information
ProviderFurther
Details
Whisper Large v2 logoOpenAI
Whisper Large v2 logoMicrosoft Azure
Wizper Large v3 logofal.ai
Incredibly Fast Whisper logoReplicate
Whisper Large v2 logoReplicate
Whisper Large v3 logoReplicate
WhisperX logoReplicate
Whisper Large v3 logoGroq
Whisper Large v3 logoDeepInfra
Whisper Large v3 logofal.ai
Whisper Large v3 Turbo logoGroq
Whisper Large v3 logoFireworks
Whisper Large v3 Turbo logoFireworks
Whisper-Large-v3 logoSambaNova
Whisper Large v3 logoTogether.ai
Speechmatics Standard logoSpeechmatics
Speechmatics Enhanced logoSpeechmatics
Azure Realtime Speech to Text logoMicrosoft Azure
Nova-2 logoDeepgram
Base logoDeepgram
Nova-3 logoDeepgram
Universal, AssemblyAI logoAssemblyAI
Slam-1 logoAssemblyAI
Amazon Transcribe logoAmazon Bedrock
Chirp logoGoogle
Chirp 2, Google logoGoogle
Chirp 3, Google logoGoogle
Scribe, ElevenLabs logoElevenLabs
Scribe v2 logoElevenLabs
Gemini 2.0 Flash logoGoogle
Gemini 2.0 Flash Lite logoGoogle
Gemini 2.5 Flash Lite logoGoogle
Gemini 2.5 Flash logoGoogle
Gemini 2.5 Pro logoGoogle
GPT-4o Transcribe logoOpenAI
GPT-4o Mini Transcribe logoOpenAI
Parakeet RNNT 1.1B logoReplicate
Parakeet TDT 0.6B V2, NVIDIA logoNVIDIA
Canary Qwen 2.5B, NVIDIA logoReplicate
Parakeet TDT 0.6B V3, Hathora logoHathora
Voxtral Mini logoMistral
Voxtral Small logoMistral
Voxtral Small logoDeepInfra
Voxtral Mini logoDeepInfra
Solaria-1, Gladia logoGladia
Nova 2 Omni logoAmazon Bedrock
Nova 2 Pro logoAmazon Bedrock

Speech to Text models & providers compared: Whisper Large v2, OpenAI, Standard, Whisper (L, v2), Azure, Azure Realtime Speech to Text, Speechmatics Enhanced, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Nova-2, Whisper (L, v2), Replicate, Base, Whisper (L, v3), Replicate, WhisperX, Replicate, Whisper (L, v3), Groq, Whisper (L, v3), Deepinfra, Whisper (L, v3), fal.ai, Whisper (L, v3, Turbo), Groq, Whisper (L, v3), Fireworks, Whisper (L, v3, Turbo), Fireworks, Universal, AssemblyAI, Amazon Transcribe, Nova-3, Chirp, Chirp 2, Google, Scribe, ElevenLabs, Gemini 2.0 Flash, Gemini 2.0 Flash Lite, GPT-4o Transcribe, GPT-4o Mini Transcribe, Whisper (L, v3), SambaNova, Parakeet RNNT 1.1B, Replicate, Whisper Large v3, together.ai, Voxtral Mini, Voxtral Small, Voxtral Small, Deepinfra, Voxtral Mini, Deepinfra, Parakeet TDT 0.6B V2, NVIDIA, Canary Qwen 2.5B, Replicate, Slam-1, Gemini 2.5 Flash Lite, Gemini 2.5 Flash, Gemini 2.5 Pro, Chirp 3, Google, Solaria-1, Gladia, Scribe v2, Nova 2 Omni, Nova 2 Pro, and Parakeet TDT 0.6B V3, Hathora.