AA-WER Streaming Index vs. Time to Final Transcription

% of words transcribed incorrectly at Final Transcription after detected End of Speech vs. Seconds to Final Transcription after End of Speech

Most attractive quadrant

Pareto line

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming Index - Final Transcription

% of words transcribed incorrectly at Final Transcription after detected End of Speech

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming - Final Transcription: AA-AgentTalk Dataset

% of words transcribed incorrectly at Final Transcription after detected End of Speech on AA-AgentTalk dataset; lower is better

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

AA-WER Streaming Index (First Partial) vs. Time to First Partial Transcription After Speech End

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech vs. Seconds to First Partial Transcription After Speech End

Most attractive quadrant

Pareto line

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming Index at First Partial Transcription After Speech End

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming at First Partial Transcription After Speech End: AA-AgentTalk Dataset

% of words transcribed incorrectly at First Partial Transcription after detected End of Speech on AA-AgentTalk dataset; lower is better

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

AA-WER Streaming - Final Transcription compared to First Partial Transcription After Speech End

% of words transcribed incorrectly · Lower is better · AA-WER Streaming incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli (25%), Earnings22 (25%)

Measures transcription accuracy of models where audio is streamed in real-time, chunk by chunk, as opposed to batch transcription, where the full audio file is submitted all at once.

AA-WER Streaming consists of around 8 hours of audio from three datasets: AA-AgentTalk (50%), VoxPopuli (25%), and Earnings22 (25%). The datasets cover real-world speech with diverse accents, domain-specific language, and challenging acoustic conditions.

AA-WER Streaming Index is a dataset-weighted average consistent with our offline STT benchmark: AA-AgentTalk 50% / VoxPopuli 25% / Earnings22 25%. WER is audio-duration-weighted within each dataset; Time to Final and Time to First Partial are simple averages within each dataset, then dataset-weighted 50% / 25% / 25% overall.

Latency

Time to Final Transcription

Seconds to Final Transcription after End of Speech

Starts at the SileroVAD-detected end of speech. For models that support forced endpointing, we send the endpoint request at this point and stop the timer on the next final transcript from the model. If the model already shared their last final before SileroVAD fired and no more finals arrive afterwards, we use that last final.

For models that do not support forced endpointing, we use the first natural final within 2 seconds of speech end. If no final arrives within 2 seconds, we use the first partial after 2 seconds, or the latest partial before 2 seconds if nothing arrives after. WER is computed on the combined previous finals plus the selected final or partial transcript.

Time to First Partial Transcription After Speech End

Seconds to First Partial Transcription After Speech End

Starts at the SileroVAD-detected end of speech and stops on the first transcript-bearing event after speech end, whether partial or final. If no transcript arrives after speech end, we use the latest transcript before speech end as the fallback snapshot.

Price

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speech to Text AI Model & Provider Leaderboard

WER Index - Final Transcription

Latency

Price

AA-WER Streaming Index vs. Time to Final Transcription

AA-WER Streaming Index vs. Time to Final Transcription

AA-WER Streaming Index - Final Transcription

AA-WER Streaming - Final Transcription: AA-AgentTalk Dataset

AA-WER Streaming Index (First Partial) vs. Time to First Partial Transcription After Speech End

AA-WER Streaming Index (First Partial) vs. Time to First Partial Transcription After Speech End

AA-WER Streaming Index at First Partial Transcription After Speech End

AA-WER Streaming at First Partial Transcription After Speech End: AA-AgentTalk Dataset

AA-WER Streaming - Final Transcription compared to First Partial Transcription After Speech End

Latency

Time to Final Transcription

Time to First Partial Transcription After Speech End

Price

Price of Transcription

Speech to Text AI Model & Provider Leaderboard

Related Links

WER Index - Final Transcription

Latency

Price

AA-WER Streaming Index vs. Time to Final Transcription

AA-WER Streaming Index vs. Time to Final Transcription

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

Time to Final Transcription After Speech End

AA-WER Streaming Index - Final Transcription

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

Final Transcription After Speech End

AA-WER Streaming - Final Transcription: AA-AgentTalk Dataset

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

Final Transcription After Speech End

AA-WER Streaming Index (First Partial) vs. Time to First Partial Transcription After Speech End

AA-WER Streaming Index (First Partial) vs. Time to First Partial Transcription After Speech End

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

Time to First Partial Transcription After Speech End

AA-WER Streaming Index at First Partial Transcription After Speech End

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

First Partial Transcription After Speech End

AA-WER Streaming at First Partial Transcription After Speech End: AA-AgentTalk Dataset

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

First Partial Transcription After Speech End

AA-WER Streaming - Final Transcription compared to First Partial Transcription After Speech End

Artificial Analysis Word Error Rate (AA-WER) Streaming Index

Latency

Time to Final Transcription

Time to Final Transcription After Speech End

Time to First Partial Transcription After Speech End

Time to First Partial Transcription After Speech End

Price

Price of Transcription

Price