Speech to Text Benchmarking Methodology
Key Metrics
Word Error Rate (WER)
Word Error Rate (WER) measures transcription accuracy by comparing model output to a reference transcript (verified human transcription).
Formula:
WER = (Substitutions + Insertions + Deletions) ÷ Words in ReferenceExample:
- Reference: the cat sat on the mat
- Hypothesis: cat is on the big mat
- Errors: deletion (the), substitution (sat → is), insertion (big)
- WER = 3 ÷ 6 = 50%
Further detail in the Word Error Rate (WER) Evaluation section.
Speed Factor
Indicates how quickly an API transcribes audio compared to the actual length of that audio.
- A value above 1 means the service transcribes faster than real time (e.g. 2.0 means 10 minutes of audio can be transcribed in 5 minutes).
Formula:
Speed Factor = Audio Duration ÷ API Response TimePrice per 1,000 Minutes
Represents the cost of transcribing 1,000 minutes of audio (real-time duration).
Note:
- Some providers (e.g. Replicate, fal) charge by processing time instead of audio duration.
- For each model offered by these providers, we:
- Test ~100 varied audio files.
- Calculate average processing time per audio minute.
- Multiply that time by the provider's per-minute processing cost.
Word Error Rate (WER) Evaluation
Artificial Analysis conducts independent WER testing across all model-provider pairs.
Key details:
- API endpoints tested directly to reflect end-user performance.
- Current evaluation: ~8 hours of audio across AA-AgentTalk, VoxPopuli-Cleaned-AA, and Earnings22-Cleaned-AA (details below). Results within each dataset are calculated as a time-weighted average, then we take a weighted average of 50% AgentTalk, 25% VoxPopuli, 25% Earnings22 to produce our AA-WER result.
- Files ≥5s in length. Chunking (splitting of audio to enable processing of longer files) depends on model limits:
- 10-minute limit models (Whisper L v2, Deepgram, GPT-4o Transcribe, GPT-4o Mini, Voxtral Mini, Gemini 2.5 Flash Lite): chunked into 9-minute segments.
- Shorter-limit models (Qwen3 ASR Flash, Granite 3.3): chunked into 30-second segments.
- For models with multiple providers (e.g. Whisper), a median WER is reported if results are within 0.4% variance.
- When models allow prompts, we provide: "Transcribe the audio verbatim, outputting only spoken words in sequence."
Datasets
AA-AgentTalk (proprietary, held-out): A proprietary evaluation dataset developed by Artificial Analysis, focused on speech relevant to voice agent use cases. See our blog post for more details.
- Number of samples: 469
- Sample duration range: 8–109 seconds
- Total duration: ~250 minutes
VoxPopuli-Cleaned-AA: European parliamentary proceedings.
- Subset: English
- Number of samples: 628
- Sample duration range: 5–38 seconds
- Total duration: ~119 minutes
Earnings22-Cleaned-AA: Corporate earnings calls with technical language and overlapping speakers.
- Number of samples: 6
- Sample duration range: ~14–22 minutes
- Total duration: ~115 minutes
For details on VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA, see our blog post. The current cleaned transcript release in both repositories is Version 1.0, used in AA-WER v2.0.
These cleaned transcripts reflect our best effort at verbatim ground truth, informed by manual review and cross-validation across the dataset. Future refinements will be released as subsequent versions. If you spot issues, we'd welcome feedback via our contact page or Discord.
Audio Samples
"Fantastic, the blog post about remote work best practices is ready to publish. The interview quotes from our distributed team members add authenticity, and the productivity statistic from Buffer's study support our main points. Schedule it for Tuesday at ten AM and promote it across our social channels. This should drive good engagement with our HR software prospects."
"I'd like to speak with someone about setting up a payment plan for my outstanding balance. My account has been past due for about two months now. The account is A A C, excuse me, A C C nine nine five two seven."
"Also can you adjust the Philips Hue bulbs in the dining room to that warm amber setting turn down the chandelier to about thirty percent brightness?"
Original: "Mr President, I have another complaint about this procedure, which is that it is not secret."
Cleaned: "Thank you Mr President, I have another complaint about this procedure, which is that it's not secret."
Original: "Furthermore the AFET opinion divides eligible countries into candidate, potential candidate, neighbourhood and in exceptional and duly justified circumstances strategically important third counties."
Cleaned: "Furthermore, the opinion of AFET divides eligible countries into candidate, potential candidate, neighbourhood and, in exceptional and duly justified circumstances, strategically important third countries."
"Thank you, Darcy, and welcome everyone to our December quarterly analyst call. December quarterly production showed a considerable improvement on the September quarter with record production throughput and improving grades, improving recoveries and improving cash flow. Unfortunately, delays accessing higher grade parts of the open pit resulted in lower grades than projected in our guidance. On the exploration front, today we announced a 70% increase in our 100% owned Yamarna resources. So they now sit at 0.5 million ounces..."
Normalization Process
Before comparison, both the reference (ground truth) and the model transcription hypothesis are normalized using OpenAI's Whisper normalizer:
- Lowercase all text; remove bracketed text and filler words ("uh", "um"); normalize whitespace
- Expand contractions (won't → will not, I'm → I am)
- Standardize numbers (twenty five → 25, two point five → 2.5)
- Normalize punctuation/symbols (remove non-semantic marks, standardize currency/percentages)
- Standardize spelling (e.g. colour → color)
In addition to OpenAI's Whisper normalizer, we've also added some additional normalization:
- Digit splitting to prevent number grouping mismatches (e.g., 1405 553 272 vs. 1405553272)
- Preservation of leading zeros, normalization of spoken symbols (e.g., "+", "_")
- Stripping redundant :00 in times (e.g., 7:00pm vs. 7pm)
- Adding of additional US / UK English spelling equivalences (e.g., totalled vs totaled)
- Accepted equivalent spellings for ambiguous proper nouns in our dataset (e.g., Mateo vs. Matteo)
Example:
- Ground truth: Hello, I'm Dr. Smith. I have twenty-five patients today.
- Model output: hello i am doctor smith i have 25 patients today
- After normalization: hello i am doctor smith i have 25 patients today
WER Calculation
- WER is based on Levenshtein distance, which aligns two sequences by finding the optimal number of substitutions, insertions, and deletions required to transform one sequence into the other.
- In this context, it measures the edits needed to change the model's transcription (hypothesis) back into the verified human transcript (reference).
- Results within datasets are aggregated as an audio-duration-weighted average WER so that numerous short clips do not bias results compared to longer files.
Testing Methodology
- Benchmarks run four times daily.
- Metrics reported as median of prior 14 days.
- When polling for results, we follow the provider's library or otherwise poll every second. For 10-minute files, a 1s variation has negligible impact.
- Only file-based transcription tested (not live-audio streaming).
Model Inclusion Criteria
Our objective is to represent the most popular and best-performing speech to text models to support end-users in choosing which models and providers to use. As such, we apply an 'industry significance' and competitive performance test to evaluate inclusion of new models & providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models, please contact us via the contact page.
Version History
AA-WER v2.0
February 2026—current
- Introduced AA-AgentTalk, a new proprietary evaluation dataset (50% weighting) focused on voice agent use cases, serving as a private held-out test set.
- Cleaned ground truth transcripts for VoxPopuli and Earnings22, correcting errors in reference transcriptions to ensure fairer evaluation.
- Removed AMI SDM from the benchmark due to transcript quality issues.
- Used custom normalizer building on Whisper's English normalizer to reduce artificially inflated WER from formatting differences rather than genuine transcription errors.
- New weighting: 50% AA-AgentTalk, 25% Earnings22-Cleaned-AA, 25% VoxPopuli-Cleaned-AA.
AA-WER v1.0
September 2025—February 2026
- Initial release with three public datasets: AMI SDM, VoxPopuli, Earnings22.
- ~6 hours of audio across three datasets.