Speech to Text Benchmarking Methodology

Artificial Analysis evaluates Speech to Text (STT) models, also known as automatic speech recognition (ASR) models, on accuracy, speed, and price. We benchmark both offline (batch) and streaming transcription. Accuracy (WER), normalization, datasets, and model inclusion criteria are shared across both; the sections below first define these shared foundations, then cover the metrics specific to batch and to streaming evaluation.

Key Metrics

Word Error Rate (WER)

Word Error Rate (WER) measures transcription accuracy by comparing model output to a reference transcript (verified human transcription).

Formula:

WER = (Substitutions + Insertions + Deletions) ÷ Words in Reference

Example:

Reference: the cat sat on the mat
Hypothesis: cat is on the big mat
Errors: deletion (the), substitution (sat → is), insertion (big)
WER = 3 ÷ 6 = 50%

Further detail in the Word Error Rate (WER) Evaluation section.

Speed Factor (batch)

Indicates how quickly a batch transcription API transcribes audio compared to the actual length of that audio.

A value above 1 means the service transcribes faster than real time (e.g. 2.0 means 10 minutes of audio can be transcribed in 5 minutes).

Our batch benchmark includes network latency. Points shown in Speed Factor Over Time are the median of four randomized measurements taken within a 24-hour period. Bars shown for Speed Factor are the median over the last 7 days.

Formula:

Speed Factor = Audio Duration ÷ API Response Time

Latency Metrics (streaming)

For streaming models we report two latency metrics:

Time to Final Transcript: time of first Final-denoted transcript shared after SileroVAD detected end of speech.
Time to First Partial Transcript: time of first Partial-denoted transcript shared after SileroVAD detected end of speech.

See the Streaming Speech to Text section for measurement detail.

Price per 1,000 Minutes

Represents the estimated cost of transcribing 1,000 minutes of audio, normalized across providers with different billing models.

How we calculate this

Audio duration: Uses the provider's published duration-based transcription price.
Processing time: Benchmarks processing time on a varied audio sample and converts processing-time charges into cost per 1,000 audio minutes.
Tokens: Benchmarks billed token usage on a representative audio sample and converts observed spend into cost per 1,000 audio minutes. Where providers expose separate reasoning tokens, those are included explicitly; otherwise, billed output tokens may include both visible output and internal reasoning tokens.
Subscription plan: Uses the lowest-upfront plan that can realistically support 1,000 minutes of transcription in a month under the provider's stated limits.

Common unit

In all cases, prices are converted into the same unit: estimated cost per 1,000 minutes of audio transcribed.

Word Error Rate (WER) Evaluation

Artificial Analysis conducts independent WER testing across all model-provider pairs.

Key details:

API endpoints tested directly to reflect end-user performance.
Current evaluation: ~8 hours of audio across AA-AgentTalk, VoxPopuli-Cleaned-AA, and Earnings22-Cleaned-AA (details below). Results within each dataset are calculated as a time-weighted average, then we take a weighted average of 50% AgentTalk, 25% VoxPopuli, 25% Earnings22 to produce our AA-WER result.
When models allow prompts, we provide: "Transcribe the audio verbatim, outputting only spoken words in sequence."

Datasets

AA-AgentTalk (proprietary, held-out): A proprietary evaluation dataset developed by Artificial Analysis, focused on speech relevant to voice agent use cases. See our blog post for more details.
- Number of samples: 469
- Sample duration range: 8–109 seconds
- Total duration: ~250 minutes
VoxPopuli-Cleaned-AA: European parliamentary proceedings.
- Subset: English
- Number of samples: 628
- Sample duration range: 5–38 seconds
- Total duration: ~119 minutes
Earnings22-Cleaned-AA: Corporate earnings calls with technical language and overlapping speakers.
- Number of samples: 6
- Sample duration range: ~14–22 minutes
- Total duration: ~115 minutes

For details on VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA, see our blog post. The current cleaned transcript release in both repositories is Version 1.0, used in AA-WER v2.

These cleaned transcripts reflect our best effort at verbatim ground truth, informed by manual review and cross-validation across the dataset. Future refinements will be released as subsequent versions. If you spot issues, we'd welcome feedback via our contact page or Discord.

Audio chunking (Earnings22)

Earnings22 contains longer files (~14-22 minutes), so chunking differs by evaluation mode. For batch (offline) transcription, we chunk only when a model cannot handle the longer file natively, starting with chunks of approximately 9 minutes and reducing to approximately 30-second chunks only if needed. For streaming transcription, all Earnings22 files are chunked to approximately 30 seconds. Chunk boundaries are chosen at end-of-speech boundaries where possible to avoid splitting mid-word. AA-AgentTalk and VoxPopuli samples are short enough that chunking does not apply.

For chunked Earnings22 audio, we aggregate results back to the parent call:

WER: Chunk transcripts are concatenated back into their parent call, and WER is computed on the combined transcript against the parent-call ground truth. WER is then audio-duration-weighted across parent calls, consistent with all other datasets.
Streaming latency: Time to Final Transcript and Time to First Partial Transcript are aggregated as a simple average across chunks.

Audio Samples

"Fantastic, the blog post about remote work best practices is ready to publish. The interview quotes from our distributed team members add authenticity, and the productivity statistic from Buffer's study support our main points. Schedule it for Tuesday at ten AM and promote it across our social channels. This should drive good engagement with our HR software prospects."

"I'd like to speak with someone about setting up a payment plan for my outstanding balance. My account has been past due for about two months now. The account is A A C, excuse me, A C C nine nine five two seven."

"Also can you adjust the Philips Hue bulbs in the dining room to that warm amber setting turn down the chandelier to about thirty percent brightness?"

Original: "Mr President, I have another complaint about this procedure, which is that it is not secret."

Cleaned: "Thank you Mr President, I have another complaint about this procedure, which is that it's not secret."

Original: "Furthermore the AFET opinion divides eligible countries into candidate, potential candidate, neighbourhood and in exceptional and duly justified circumstances strategically important third counties."

Cleaned: "Furthermore, the opinion of AFET divides eligible countries into candidate, potential candidate, neighbourhood and, in exceptional and duly justified circumstances, strategically important third countries."

"Thank you, Darcy, and welcome everyone to our December quarterly analyst call. December quarterly production showed a considerable improvement on the September quarter with record production throughput and improving grades, improving recoveries and improving cash flow. Unfortunately, delays accessing higher grade parts of the open pit resulted in lower grades than projected in our guidance. On the exploration front, today we announced a 70% increase in our 100% owned Yamarna resources. So they now sit at 0.5 million ounces..."

Normalization Process

Before comparison, both the reference (ground truth) and the model transcription hypothesis are normalized using OpenAI's Whisper normalizer:

Lowercase all text; remove bracketed text and filler words ("uh", "um"); normalize whitespace
Expand contractions (won't vs. will not, I'm vs. I am)
Standardize numbers (twenty five vs. 25, two point five vs. 2.5)
Normalize punctuation/symbols (remove non-semantic marks, standardize currency/percentages)
Standardize spelling (e.g. colour vs. color)

In addition to OpenAI's Whisper normalizer, we've also added some additional normalization:

Digit, ID, and phone-number formatting for equivalent compact, spaced, parenthesized, and hyphen-grouped forms (e.g., 1405 553 272 vs. 1405553272, (303) 775-4498 vs. 303 775 4498, CLM-2024-0873 vs. CLM 2024 0873)
Letter-separated names and spelled-letter sequences across separator styles (e.g., S I N G H vs. S-I-N-G-H, A B C vs. A-B-C)
Preservation of leading zeros and normalization of equivalent spoken symbol forms (e.g., 06100052, + vs. plus, engagement underscore rate vs. engagement_rate)
Time formatting for equivalent clock forms (e.g., 7:00pm vs. 7pm, 10:30 AM vs. ten thirty AM)
Adding of additional US / UK English spelling equivalences (e.g., totalled vs totaled)
Accepted equivalent spellings for ambiguous proper nouns in our dataset (e.g., Mateo vs. Matteo)
Numeric range formatting when endpoints and units match (e.g., 15-20 minutes vs. fifteen to twenty minutes, 6-8% vs. six to eight percent)
Common spacing and hyphenation variants where meaning is unchanged (e.g., hard-coded vs. hardcoded, up front vs. upfront, Sea-Tac vs. SeaTac)
Preservation of meaning-bearing codes and symbols where omission or substitution changes the transcript (e.g., F-150, W-2, $50, 5%)
Repeated spoken large-number phrases are kept separate rather than merged into a new number (e.g., twelve million twelve million becomes 12,000,000 12,000,000, not 24,000,000)
Bracketed transcribed content is kept, while annotation tags such as [music], <unk>, and language English<asr_text> are removed

Example:

Ground truth: Hello, I'm Dr. Smith. I have twenty-five patients today.
Model output: hello i am doctor smith i have 25 patients today
After normalization: hello i am doctor smith i have 25 patients today

WER Calculation

WER is based on Levenshtein distance, which aligns two sequences by finding the optimal number of substitutions, insertions, and deletions required to transform one sequence into the other.
In this context, it measures the edits needed to change the model's transcription (hypothesis) back into the verified human transcript (reference).
Results within datasets are aggregated as an audio-duration-weighted average WER so that numerous short clips do not bias results compared to longer files.

Streaming Speech to Text

The streaming benchmark conveys the accuracy of streaming models particularly for voice-agent use cases while maintaining comparability to the batch benchmark above. WER, normalization, datasets, weighting, and audio chunking are as defined above. We report final-transcript WER and, separately, partial-transcript WER for the first partial after end-of-speech.

For streaming, latency metrics are sample-weighted within each dataset, while WER remains audio-duration-weighted. For chunked Earnings22 audio, see the chunking aggregation note above.

Streaming mechanics

Audio is streamed in real time in 20ms chunks, or as recommended by the public model config. Streaming STT models return two transcript types:

Partials: not yet confirmed, subject to change.
Finals: confirmed, no longer changing.

Network delay is included in all latency measurements. It reflects the real end-user experience on proprietary models and can pose a challenge to standardize across providers.

Forced endpointing

Most models let us trigger a final at a specific time, such as when a separate VAD detects end of speech. This matters because voice-agent developers often run a separate, tuned endpointing model to control how aggressively transcripts are pulled. Models that support forced endpointing are evaluated on the standard path below; models that do not are evaluated on the fallback logic (see below timing logic details).

Where a provider exposes endpoint-related configuration, we set it so that only the forced endpoint signal triggers a final transcript, and we use the provider's publicly recommended configuration for this use case.

Timing logic: forced endpointing supported

We stream audio in real time. When SileroVAD detects end of speech, we send the force-endpoint request.
The Time to Final timer starts at the wall-clock moment the audio through the speech-end point has actually been streamed and the force-endpoint signal sent, not the audio-file speech-end timestamp. This avoids contaminating latency with pacing slip on either the test harness or model end.
The timer stops on the next transcript final from the model; WER is computed on the combined finals.
If the model already shared a final before SileroVAD fired, we use the most recent one.
Time to First Partial is measured the same way, stopping on the first partial received after the force-endpoint signal.

Timing logic: forced endpointing not supported

Use the first natural final within 2s of SileroVAD end-of-speech.
If no natural final appears within 2s, use the first partial arriving after 2s.
If nothing arrives after 2s, use the latest partial before 2s.

Reported metrics

Final WER is computed from the combined final transcripts plus the selected final or fallback partial from the logic above, while partial WER is computed from the combined final transcripts plus the selected post-endpoint partial from the logic above. Both are compared against the full ground truth transcript. As the first post-endpoint partial typically reflects an earlier point in the model's streaming output, partial WER and final WER are reported separately rather than compared directly.

Model Inclusion Criteria

Our objective is to represent the most popular and best-performing speech to text models to support end-users in choosing which models and providers to use. As such, we apply an 'industry significance' and competitive performance test to evaluate inclusion of new models & providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models, please contact us via the contact page.

Version History

AA-WER v2.2

May 2026—current

Further updated the English text comparison normalizer to treat additional meaning-preserving transcription variants as equivalent, while preserving penalties for meaning-bearing omissions or substitutions:
- Unicode apostrophe variants across equivalent wording (e.g., we’re vs. we're)
- Signed percentage formatting when meaning matches (e.g., -12% vs. negative twelve percent)
- Decimal grouping where there is no currency marker (e.g., 99.99 vs. ninety-nine ninety-nine), while preserving penalties for missing currency symbols (e.g., $71.50 is not equivalent to seventy-one fifty)
- Formatting-only hyphen/dash variants (e.g., 18-year-old vs. 18 year old)
- Technical punctuation in clear code, URL, and email contexts (e.g., UserController.java vs. UserController dot java, /v1/users vs. slash v1 slash users, gmail.com vs. gmail dot com)
- Version, decimal, and fraction variants where the spoken and written forms match (e.g., version 1.24.3 vs. version one point twenty-four dot three, 1.0 vs. one point zero, 1.25 inches vs. one and a quarter inches)
- Unit abbreviations where meaning is unchanged (e.g., mg/dL vs. milligrams per deciliter, GB vs. gigabytes)
Launched streaming Speech to Text evaluation, adding Time to Final Transcript and Time to First Partial Transcript latency metrics.

AA-WER v2.1

April 2026—May 2026

Updated the English text comparison normalizer to treat additional meaning-preserving transcription variants as equivalent, while preserving penalties for meaning-bearing omissions or substitutions:
- ID and phone-number formatting where hyphens or spacing are used only as digit grouping (e.g., 303-775-4498 vs. 303 775 4498, CLM-2024-0873 vs. CLM 2024 0873)
- Letter-separated names and spelled-letter sequences across separator styles (e.g., S I N G H vs. S-I-N-G-H)
- Numeric range formatting when endpoints and units match (e.g., 15-20 minutes vs. fifteen to twenty minutes, 6-8% vs. six to eight percent)
- Common spacing and hyphenation variants where meaning is unchanged (e.g., hard-coded vs. hardcoded, up front vs. upfront, Sea-Tac vs. SeaTac)
- Preservation of meaning-bearing codes and symbols where omission or substitution changes the transcript (e.g., F-150, W-2, $50, 5%)
- Repeated spoken large-number phrases are kept separate rather than merged into a new number (e.g., twelve million twelve million becomes 12,000,000 12,000,000, not 24,000,000)
- Bracketed transcribed content is kept, while annotation tags such as [music], <unk>, and language English<asr_text> are removed

AA-WER v2.0

February 2026—current

Introduced AA-AgentTalk, a new proprietary evaluation dataset (50% weighting) focused on voice agent use cases, serving as a private held-out test set.
Cleaned ground truth transcripts for VoxPopuli and Earnings22, correcting errors in reference transcriptions to ensure fairer evaluation.
Removed AMI SDM from the benchmark due to transcript quality issues.
Used custom normalizer building on Whisper's English normalizer to reduce artificially inflated WER from formatting differences rather than genuine transcription errors.
New weighting: 50% AA-AgentTalk, 25% Earnings22-Cleaned-AA, 25% VoxPopuli-Cleaned-AA.

AA-WER v1.0

September 2025—February 2026

Initial release with three public datasets: AMI SDM, VoxPopuli, Earnings22.
~6 hours of audio across three datasets.