Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis
All articles

February 18, 2026

AA-WER v2.0: Speech to Text Accuracy Benchmark

We're introducing AA-WER v2.0, a major update to our Speech to Text (STT) accuracy benchmark featuring AA-AgentTalk, a new proprietary evaluation dataset focused on speech directed at voice agents, alongside cleaned ground truth transcripts for VoxPopuli and Earnings22.

Accurate STT benchmarking is critical for stakeholders across the speech AI ecosystem because it allows:

  • Developers and enterprises to select the right STT model and provider based on reliable accuracy data and speech that better reflects their use case
  • AI labs to better support customers by understanding how their models perform on diverse, real-world speech

Why AA-WER v2.0

Voice agents are one of the fastest growing applications of Speech to Text (STT). From customer service and internal support agents to voice-prompted code generation and AI-enhanced meeting notes, the accuracy of the underlying transcription model directly impacts user experience and downstream task performance.

Yet most public STT benchmarks were developed before this wave. Widely used public evaluation datasets typically cover articles and books read aloud, parliamentary proceedings, earnings calls, and other meeting recordings, but none that we've seen focused on voice agent interaction. Public test sets also carry a growing risk of data contamination, and we found errors in existing ground truth transcriptions that unfairly penalize models which correctly transcribe the audio.

AA-WER v2.0 addresses these gaps with AA-AgentTalk, a new proprietary held-out dataset centered on voice agent evaluation, and cleaned ground truth transcripts for VoxPopuli and Earnings22.

What's new in AA-WER v2.0

We've made five key changes to strengthen the benchmark:

  1. New proprietary dataset - AA-AgentTalk (50% weighting): 469 samples (~250 minutes) of speech relevant to voice agent use cases, serving as a held-out test set that mitigates the risk of models overfitting to public benchmarks
  2. Cleaned transcripts for public datasets: We manually reviewed and corrected errors in the original ground truth for VoxPopuli and Earnings22, releasing VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA
  3. Removal of AMI-SDM: Transcript errors were too extensive to correct without making a large number of judgment calls, so we removed this dataset from the benchmark
  4. Improved text normalizer: A custom normalizer building on Whisper's English normalizer to reduce artificially inflated WER from formatting differences rather than genuine transcription errors
  5. New weighting: 50% AA-AgentTalk, 25% VoxPopuli-Cleaned-AA, 25% Earnings22-Cleaned-AA

AA-WER v2.0 at a Glance

The updated benchmark evaluates models across ~8 hours of audio from three datasets, weighted to reflect the importance of voice agent use cases while retaining coverage of formal and technical speech:

DatasetTypeSamplesSample Duration RangeTotal DurationWeighting
AA-AgentTalkProprietary, held-out, including emerging use cases such as voice agents4698-109 seconds~250 min50%
VoxPopuli-Cleaned-AAParliamentary proceedings6285-38 seconds~119 min25%
Earnings22-Cleaned-AACorporate earnings calls6~14-22 minutes~115 min25%

Results within each dataset are calculated as a time-weighted average WER, then combined using the weightings above to produce the final AA-WER score.

Key results

ElevenLabs Scribe v2 leads at 2.3% AA-WER, followed by Google's Gemini 3 Pro at 2.9%, Mistral's Voxtral Small at 3.0%, Google's Gemini 2.5 Pro at 3.1%, and Gemini 3 Flash at 3.2%. ElevenLabs Scribe v2 leads on two of the three component datasets, while Google's Gemini 3 Pro leads on VoxPopuli.

AA-WER v2.0 Overall ResultsAA-WER v2.0 Overall Results

Performance varies across datasets, revealing distinct model strengths:

  • AA-AgentTalk (voice agent speech): ElevenLabs Scribe v2 (1.6%), AssemblyAI Universal-3 Pro (2.3%), Google Gemini 3 Flash (2.5%)
  • Earnings22-Cleaned-AA (corporate earnings calls): ElevenLabs Scribe v2 (4.1%), Google Gemini 2.5 Pro (4.4%), NVIDIA Parakeet TDT 0.6B V3 (4.9%)
  • VoxPopuli-Cleaned-AA (parliamentary proceedings): Google Gemini 3 Pro (1.7%), ElevenLabs Scribe v2 (1.8%), Mistral Voxtral Small (1.8%)

AA-WER by Individual Dataset - AA-AgentTalkAA-WER by Individual Dataset - AA-AgentTalk

AA-WER by Individual Dataset - Earnings22-Cleaned-AAAA-WER by Individual Dataset - Earnings22-Cleaned-AA

AA-WER by Individual Dataset - VoxPopuli-Cleaned-AAAA-WER by Individual Dataset - VoxPopuli-Cleaned-AA

AA-AgentTalk: A new dataset for voice agent evaluation

AA-AgentTalk is a proprietary evaluation dataset developed by Artificial Analysis, focused on the types of speech directed at voice agents and other growing real-world applications. As a held-out dataset, it mitigates the risk of models being trained or fine-tuned to perform well on public test sets.

The dataset comprises 469 audio samples totalling ~250 minutes, with individual samples ranging from 8 to 109 seconds. Samples are scripted prompts read aloud by participants, designed to reflect realistic voice agent interactions across 6 content categories, 17 accent groups, and 8 speaking styles. Prompts were generated to reflect realistic voice agent interactions, spanning voice agents & call centers (40%), AI agent interaction (20%), industry jargon (15%), meetings (10%), consumer & personal (10%), and media (5%).

CategorySubcategoriesSample
Voice Agents & Call Centers (40%)Customer and Internal Support, Complex Inquiries and Escalations, AI Receptionist, Scheduling & Booking
AI Agent Prompting & Interaction (20%)Coding Agents, Research & Analysis Agents, Writing & Communication Agents, Data & Analysis Agents, Workflow & Task Agents
Industry Jargon (15%)Financial Services, Healthcare, Insurance, Legal, Engineering & Resources, Other
Meetings & Collaboration (10%)Client & Sales Meetings, Financial & Business Meetings, Technical Meetings, Corporate Meetings, Cross-functional Meetings
Consumer & Personal (10%)Personal Assistant, Shopping, Entertainment, Smart Home, Productivity Apps
Media (5%)Interviews/Podcasts, Documentary/YouTube Content, Panel/Talk Shows, News & Commentary, Sports Commentary

Accent Distribution of ParticipantsAccent Distribution of Participants

Device & Microphone Breakdown

DeviceBuilt-inUSBWired headphones/earbudsBluetooth headset/earbudsOther
Laptop47.0%2.6%5.1%8.5%0.9%
Phone17.9%--0.9%-
Desktop2.6%8.5%4.3%--
Tablet0.9%-0.9%0.9%-

The majority of recordings (75%) use a natural speaking voice, with the remainder assigned styles including speaking quickly, quietly, slowly, louder, tired/yawning, alternating volume, and slightly out of breath. Approximately 70% were recorded in quiet indoor environments, with 30% featuring background noise.

Dataset development

We recruited participants to record themselves reading prompts designed to reflect real voice agent interactions. Each participant recorded multiple prompts across different categories and speaking styles using their own devices in their natural environments.

As a first pass, participants reviewed their own transcripts and could edit them to reflect what they actually said rather than what the prompt text contained. We then reviewed approximately 1,300 recordings and narrowed them to 469, ensuring transcripts accurately matched their audio. Our team then manually reviewed all transcripts for accuracy, approving or correcting each one to ensure verbatim ground truth.

Cleaned transcripts for VoxPopuli and Earnings22

Reference transcripts in VoxPopuli and Earnings22 contained inaccuracies - instances where the ground truth didn't match what was actually spoken. Inaccurate ground truth penalizes models that correctly transcribe the audio, so we manually reviewed and corrected the transcripts, releasing VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA.

What we corrected

We corrected transcripts to reflect verbatim what speakers said. Key corrections included:

  • Incorrect words: Misspellings, misheard words, incorrect contractions in the original references
  • Missed words: Retained or added repetitions for verbatim accuracy (e.g., "the the" where the speaker genuinely repeated a word)
  • Partial stuttering: Removed incomplete word fragments (e.g., "evac-" in "evac- evacuate") as these are inherently ambiguous in transcription
  • Grammar and tense: When speakers used incorrect grammar (particularly speakers with accents) but the word choice was clear, we kept verbatim words as spoken rather than correcting them

Elements already normalized by the whisperNormalizer package (e.g., capitalization, punctuation, and filler words) were not modified, since these differences are already handled during WER calculation.

VoxPopuli-Cleaned-AA: before and after

Original transcript: "Mr President, I have another complaint about this procedure, which is that it is not secret."

Cleaned transcript: "Thank you Mr President, I have another complaint about this procedure, which is that it's not secret."

Original transcript: "Furthermore the AFET opinion divides eligible countries into candidate, potential candidate, neighbourhood and in exceptional and duly justified circumstances strategically important third counties."

Cleaned transcript: "Furthermore, the opinion of AFET divides eligible countries into candidate, potential candidate, neighbourhood and, in exceptional and duly justified circumstances, strategically important third countries."

On average, model WER on VoxPopuli went down 3.5 percentage points (p.p.) after cleaning. Top WER decreases on VoxPopuli were Rev AI (5.4 p.p.), Nova-3, Deepgram (5.2 p.p.), and Whisper Large v3, OpenAI (5.1 p.p.). Only one model worsened after cleaning: Chirp 2, Google (+1.4 p.p. higher WER).

VoxPopuli: Cleaned vs Original Subset of Publicly Available DataVoxPopuli: Cleaned vs Original Subset of Publicly Available Data

Earnings22-Cleaned-AA: sample

"Thank you, Darcy, and welcome everyone to our December quarterly analyst call. December quarterly production showed a considerable improvement on the September quarter with record production throughput and improving grades, improving recoveries and improving cash flow..."

On average, model WER on Earnings22 went down 5.6 percentage points (p.p.) after cleaning, and no models had higher WER after cleaning. Top WER decreases on Earnings22 were Qwen3 ASR Flash, Alibaba (7.7 p.p.), Parakeet TDT 0.6B V3, NVIDIA (7.4 p.p.), and Gemini 2.5 Pro, Google (6.9 p.p.).

Top models whose WER decreased across both datasets were Whisper Large v3, OpenAI (5.1 p.p. on VoxPopuli; 6.4 p.p. on Earnings22), Voxtral Small, Mistral (4.6 p.p.; 6.4 p.p.), and Voxtral Mini, Mistral (4.1 p.p.; 6.8 p.p.).

Earnings22: Cleaned vs Original Subset of Publicly Available DataEarnings22: Cleaned vs Original Subset of Publicly Available Data

Public release

We are releasing our cleaned transcript datasets publicly on Hugging Face to support the STT research community:

The current cleaned transcript release in both repositories is Version 1.0, used in AA-WER v2.0.

These cleaned transcripts reflect our best effort at verbatim ground truth, informed by manual review and cross-validation across the dataset. Future refinements will be released as subsequent versions. If you spot issues, we'd welcome feedback via our contact page or Discord.

Improved text normalizer

We developed a custom text normalizer building on the whisperNormalizer package to reduce artificially inflated WER from formatting differences rather than genuine transcription errors. Key improvements include:

  • Digit splitting: Prevents number grouping mismatches (e.g., "1405 553 272" vs. "1405553272") from counting as errors
  • Leading zeros: Preserves leading zeros in codes and identifiers
  • Spoken symbols: Normalizes symbols like "+" and "_" when spoken aloud
  • Time formatting: Strips redundant ":00" in times (e.g., "7:00pm" vs. "7pm")
  • Spelling equivalences: Adds additional US/UK English spelling equivalences (e.g., "totalled" vs. "totaled")
  • Proper noun variants: Accepts equivalent spellings for ambiguous proper nouns in our dataset (e.g., "Mateo" vs. "Matteo")

These normalizations ensure models are evaluated on actual transcription accuracy rather than surface-level formatting choices.

As Speech to Text models continue to rapidly improve, we'll keep updating our benchmarks with new models, datasets, and evaluation methods.

Resources