February 18, 2026
AA-WER v2.0: Speech to Text Accuracy Benchmark
We're introducing AA-WER v2.0, a major update to our Speech to Text (STT) accuracy benchmark featuring AA-AgentTalk, a new proprietary evaluation dataset focused on speech directed at voice agents, alongside cleaned ground truth transcripts for VoxPopuli and Earnings22.
Accurate STT benchmarking is critical for stakeholders across the speech AI ecosystem because it allows:
- Developers and enterprises to select the right STT model and provider based on reliable accuracy data and speech that better reflects their use case
- AI labs to better support customers by understanding how their models perform on diverse, real-world speech
Why AA-WER v2.0
Voice agents are one of the fastest growing applications of Speech to Text (STT). From customer service and internal support agents to voice-prompted code generation and AI-enhanced meeting notes, the accuracy of the underlying transcription model directly impacts user experience and downstream task performance.
Yet most public STT benchmarks were developed before this wave. Widely used public evaluation datasets typically cover articles and books read aloud, parliamentary proceedings, earnings calls, and other meeting recordings, but none that we've seen focused on voice agent interaction. Public test sets also carry a growing risk of data contamination, and we found errors in existing ground truth transcriptions that unfairly penalize models which correctly transcribe the audio.
AA-WER v2.0 addresses these gaps with AA-AgentTalk, a new proprietary held-out dataset centered on voice agent evaluation, and cleaned ground truth transcripts for VoxPopuli and Earnings22.
What's new in AA-WER v2.0
We've made five key changes to strengthen the benchmark:
- New proprietary dataset - AA-AgentTalk (50% weighting): 469 samples (~250 minutes) of speech relevant to voice agent use cases, serving as a held-out test set that mitigates the risk of models overfitting to public benchmarks
- Cleaned transcripts for public datasets: We manually reviewed and corrected errors in the original ground truth for VoxPopuli and Earnings22, releasing VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA
- Removal of AMI-SDM: Transcript errors were too extensive to correct without making a large number of judgment calls, so we removed this dataset from the benchmark
- Improved text normalizer: A custom normalizer building on Whisper's English normalizer to reduce artificially inflated WER from formatting differences rather than genuine transcription errors
- New weighting: 50% AA-AgentTalk, 25% VoxPopuli-Cleaned-AA, 25% Earnings22-Cleaned-AA
AA-WER v2.0 at a Glance
The updated benchmark evaluates models across ~8 hours of audio from three datasets, weighted to reflect the importance of voice agent use cases while retaining coverage of formal and technical speech:
| Dataset | Type | Samples | Sample Duration Range | Total Duration | Weighting |
|---|---|---|---|---|---|
| AA-AgentTalk | Proprietary, held-out, including emerging use cases such as voice agents | 469 | 8-109 seconds | ~250 min | 50% |
| VoxPopuli-Cleaned-AA | Parliamentary proceedings | 628 | 5-38 seconds | ~119 min | 25% |
| Earnings22-Cleaned-AA | Corporate earnings calls | 6 | ~14-22 minutes | ~115 min | 25% |
Results within each dataset are calculated as a time-weighted average WER, then combined using the weightings above to produce the final AA-WER score.
Key results
ElevenLabs Scribe v2 leads at 2.3% AA-WER, followed by Google's Gemini 3 Pro at 2.9%, Mistral's Voxtral Small at 3.0%, Google's Gemini 2.5 Pro at 3.1%, and Gemini 3 Flash at 3.2%. ElevenLabs Scribe v2 leads on two of the three component datasets, while Google's Gemini 3 Pro leads on VoxPopuli.
AA-WER v2.0 Overall Results
Performance varies across datasets, revealing distinct model strengths:
- AA-AgentTalk (voice agent speech): ElevenLabs Scribe v2 (1.6%), AssemblyAI Universal-3 Pro (2.3%), Google Gemini 3 Flash (2.5%)
- Earnings22-Cleaned-AA (corporate earnings calls): ElevenLabs Scribe v2 (4.1%), Google Gemini 2.5 Pro (4.4%), NVIDIA Parakeet TDT 0.6B V3 (4.9%)
- VoxPopuli-Cleaned-AA (parliamentary proceedings): Google Gemini 3 Pro (1.7%), ElevenLabs Scribe v2 (1.8%), Mistral Voxtral Small (1.8%)
AA-WER by Individual Dataset - AA-AgentTalk
AA-WER by Individual Dataset - Earnings22-Cleaned-AA
AA-WER by Individual Dataset - VoxPopuli-Cleaned-AA
AA-AgentTalk: A new dataset for voice agent evaluation
AA-AgentTalk is a proprietary evaluation dataset developed by Artificial Analysis, focused on the types of speech directed at voice agents and other growing real-world applications. As a held-out dataset, it mitigates the risk of models being trained or fine-tuned to perform well on public test sets.
The dataset comprises 469 audio samples totalling ~250 minutes, with individual samples ranging from 8 to 109 seconds. Samples are scripted prompts read aloud by participants, designed to reflect realistic voice agent interactions across 6 content categories, 17 accent groups, and 8 speaking styles. Prompts were generated to reflect realistic voice agent interactions, spanning voice agents & call centers (40%), AI agent interaction (20%), industry jargon (15%), meetings (10%), consumer & personal (10%), and media (5%).
| Category | Subcategories | Sample |
|---|---|---|
| Voice Agents & Call Centers (40%) | Customer and Internal Support, Complex Inquiries and Escalations, AI Receptionist, Scheduling & Booking | |
| AI Agent Prompting & Interaction (20%) | Coding Agents, Research & Analysis Agents, Writing & Communication Agents, Data & Analysis Agents, Workflow & Task Agents | |
| Industry Jargon (15%) | Financial Services, Healthcare, Insurance, Legal, Engineering & Resources, Other | |
| Meetings & Collaboration (10%) | Client & Sales Meetings, Financial & Business Meetings, Technical Meetings, Corporate Meetings, Cross-functional Meetings | |
| Consumer & Personal (10%) | Personal Assistant, Shopping, Entertainment, Smart Home, Productivity Apps | |
| Media (5%) | Interviews/Podcasts, Documentary/YouTube Content, Panel/Talk Shows, News & Commentary, Sports Commentary |
Accent Distribution of Participants
Device & Microphone Breakdown
| Device | Built-in | USB | Wired headphones/earbuds | Bluetooth headset/earbuds | Other |
|---|---|---|---|---|---|
| Laptop | 47.0% | 2.6% | 5.1% | 8.5% | 0.9% |
| Phone | 17.9% | - | - | 0.9% | - |
| Desktop | 2.6% | 8.5% | 4.3% | - | - |
| Tablet | 0.9% | - | 0.9% | 0.9% | - |
The majority of recordings (75%) use a natural speaking voice, with the remainder assigned styles including speaking quickly, quietly, slowly, louder, tired/yawning, alternating volume, and slightly out of breath. Approximately 70% were recorded in quiet indoor environments, with 30% featuring background noise.
Dataset development
We recruited participants to record themselves reading prompts designed to reflect real voice agent interactions. Each participant recorded multiple prompts across different categories and speaking styles using their own devices in their natural environments.
As a first pass, participants reviewed their own transcripts and could edit them to reflect what they actually said rather than what the prompt text contained. We then reviewed approximately 1,300 recordings and narrowed them to 469, ensuring transcripts accurately matched their audio. Our team then manually reviewed all transcripts for accuracy, approving or correcting each one to ensure verbatim ground truth.
Cleaned transcripts for VoxPopuli and Earnings22
Reference transcripts in VoxPopuli and Earnings22 contained inaccuracies - instances where the ground truth didn't match what was actually spoken. Inaccurate ground truth penalizes models that correctly transcribe the audio, so we manually reviewed and corrected the transcripts, releasing VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA.
What we corrected
We corrected transcripts to reflect verbatim what speakers said. Key corrections included:
- Incorrect words: Misspellings, misheard words, incorrect contractions in the original references
- Missed words: Retained or added repetitions for verbatim accuracy (e.g., "the the" where the speaker genuinely repeated a word)
- Partial stuttering: Removed incomplete word fragments (e.g., "evac-" in "evac- evacuate") as these are inherently ambiguous in transcription
- Grammar and tense: When speakers used incorrect grammar (particularly speakers with accents) but the word choice was clear, we kept verbatim words as spoken rather than correcting them
Elements already normalized by the whisperNormalizer package (e.g., capitalization, punctuation, and filler words) were not modified, since these differences are already handled during WER calculation.
VoxPopuli-Cleaned-AA: before and after
Original transcript: "Mr President, I have another complaint about this procedure, which is that it is not secret."
Cleaned transcript: "Thank you Mr President, I have another complaint about this procedure, which is that it's not secret."
Original transcript: "Furthermore the AFET opinion divides eligible countries into candidate, potential candidate, neighbourhood and in exceptional and duly justified circumstances strategically important third counties."
Cleaned transcript: "Furthermore, the opinion of AFET divides eligible countries into candidate, potential candidate, neighbourhood and, in exceptional and duly justified circumstances, strategically important third countries."
On average, model WER on VoxPopuli went down 3.5 percentage points (p.p.) after cleaning. Top WER decreases on VoxPopuli were Rev AI (5.4 p.p.), Nova-3, Deepgram (5.2 p.p.), and Whisper Large v3, OpenAI (5.1 p.p.). Only one model worsened after cleaning: Chirp 2, Google (+1.4 p.p. higher WER).
VoxPopuli: Cleaned vs Original Subset of Publicly Available Data
Earnings22-Cleaned-AA: sample
"Thank you, Darcy, and welcome everyone to our December quarterly analyst call. December quarterly production showed a considerable improvement on the September quarter with record production throughput and improving grades, improving recoveries and improving cash flow..."
On average, model WER on Earnings22 went down 5.6 percentage points (p.p.) after cleaning, and no models had higher WER after cleaning. Top WER decreases on Earnings22 were Qwen3 ASR Flash, Alibaba (7.7 p.p.), Parakeet TDT 0.6B V3, NVIDIA (7.4 p.p.), and Gemini 2.5 Pro, Google (6.9 p.p.).
Top models whose WER decreased across both datasets were Whisper Large v3, OpenAI (5.1 p.p. on VoxPopuli; 6.4 p.p. on Earnings22), Voxtral Small, Mistral (4.6 p.p.; 6.4 p.p.), and Voxtral Mini, Mistral (4.1 p.p.; 6.8 p.p.).
Earnings22: Cleaned vs Original Subset of Publicly Available Data
Public release
We are releasing our cleaned transcript datasets publicly on Hugging Face to support the STT research community:
The current cleaned transcript release in both repositories is Version 1.0, used in AA-WER v2.0.
These cleaned transcripts reflect our best effort at verbatim ground truth, informed by manual review and cross-validation across the dataset. Future refinements will be released as subsequent versions. If you spot issues, we'd welcome feedback via our contact page or Discord.
Improved text normalizer
We developed a custom text normalizer building on the whisperNormalizer package to reduce artificially inflated WER from formatting differences rather than genuine transcription errors. Key improvements include:
- Digit splitting: Prevents number grouping mismatches (e.g., "1405 553 272" vs. "1405553272") from counting as errors
- Leading zeros: Preserves leading zeros in codes and identifiers
- Spoken symbols: Normalizes symbols like "+" and "_" when spoken aloud
- Time formatting: Strips redundant ":00" in times (e.g., "7:00pm" vs. "7pm")
- Spelling equivalences: Adds additional US/UK English spelling equivalences (e.g., "totalled" vs. "totaled")
- Proper noun variants: Accepts equivalent spellings for ambiguous proper nouns in our dataset (e.g., "Mateo" vs. "Matteo")
These normalizations ensure models are evaluated on actual transcription accuracy rather than surface-level formatting choices.
As Speech to Text models continue to rapidly improve, we'll keep updating our benchmarks with new models, datasets, and evaluation methods.