Speech to Text AI Model & Provider Leaderboard

Compare word error rate, speed, and pricing across Speech to Text models and providers.

For further details, see our methodology page.

You may also be interested in...

Highlights

% of words transcribed incorrectly · Lower is better · AA-WER Streaming incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli (25%), Earnings22 (25%)
Seconds to final transcript after speech end · weighted average of samples in AA-WER Streaming · Lower is better
USD per 1000 minutes of audio · Lower is better

AA-WER Streaming Index vs. Time to Final Transcription

AA-WER Streaming Index vs. Time to Final Transcription

% of words transcribed incorrectly at Final Transcription after detected End of Speech vs. Seconds to Final Transcription after End of Speech
Most attractive quadrant
AssemblyAI
Cartesia
Deepgram
ElevenLabs
Google
Inworld
Microsoft Azure
Mistral
NVIDIA
OpenAI
Smallest.ai
Soniox

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming Index - Final Transcription

% of words transcribed incorrectly at Final Transcription after detected End of Speech

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming - Final Transcription: AA-AgentTalk Dataset

% of words transcribed incorrectly at Final Transcription after detected End of Speech on AA-AgentTalk dataset; lower is better

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming Index vs. Time to First Partial Transcription After Speech End

AA-WER Streaming Index vs. Time to First Partial Transcription After Speech End

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech vs. Seconds to First Partial Transcription After Speech End
Most attractive quadrant
AssemblyAI
Cartesia
Deepgram
ElevenLabs
Google
Inworld
Microsoft Azure
Mistral
NVIDIA
OpenAI
Smallest.ai
Soniox

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming Index - First Partial Transcription After Speech End

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming - First Partial Transcription After Speech End: AA-AgentTalk Dataset

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech on AA-AgentTalk dataset; lower is better

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming - Final Transcription compared to First Partial Transcription After Speech End

% of words transcribed incorrectly · Lower is better · AA-WER Streaming incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli (25%), Earnings22 (25%)

Measures transcription accuracy of models where audio is streamed in realtime, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Latency

Time to Final Transcription

Seconds to Final Transcription after End of Speech

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

Time to First Partial Transcription After Speech End: Latency to First Partial Transcription After Speech End

Seconds to First Partial Transcription After Speech End

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

Price

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speech to Text Streaming models compared: AssemblyAI U3 Realtime Pro, Cartesia Ink, ElevenLabs Scribe v2 Realtime, Gladia Solaria 1 Realtime, Deepgram Flux, Speechmatics Realtime Enhanced, Nemotron 3 ASR 80ms, Nemotron 3 ASR 160ms, Nemotron 3 ASR 560ms, Nemotron 3 ASR 1120ms, Soniox Realtime, Inworld STT 1 Realtime, OpenAI GPT Realtime, Chirp 3 Streaming, Voxtral Mini Transcribe Realtime, RevAI Streaming, Deepgram Nova-3 Realtime, Pulse STT Realtime, Gradium STT Realtime, Qwen3 ASR Flash Realtime, Amazon Transcribe Streaming, Azure STT Real-time Transcription, Cartesia Ink-2 (external endpoints), Ink-2 Turn Detection Eager End, Cartesia Ink-2 (semantic endpoints).