Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
Insights Login
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • MicroEvals
  • Articles
Insights Login
On this page
  • Key Metrics
    • Word Error Rate (WER)
    • Speed Factor
    • Price per 1,000 Minutes
  • Word Error Rate (WER) Evaluation
    • Datasets and samples
    • Normalization Process
    • WER Calculation
  • Testing Methodology
  • Model Inclusion Criteria
  • Statement of Independence

Speech to Text Benchmarking Methodology

Key Metrics

Word Error Rate (WER)

Word Error Rate (WER) measures transcription accuracy by comparing model output to a reference transcript (verified human transcription).

Formula:

WER = (Substitutions + Insertions + Deletions) ÷ Words in Reference

Example:

  • Reference: the cat sat on the mat
  • Hypothesis: cat is on the big mat
  • Errors: deletion (the), substitution (sat → is), insertion (big)
  • WER = 3 ÷ 6 = 50%

Further detail in the Word Error Rate (WER) Evaluation section.

Speed Factor

Indicates how quickly an API transcribes audio compared to the actual length of that audio.

  • A value above 1 means the service transcribes faster than real time (e.g. 2.0 means 10 minutes of audio can be transcribed in 5 minutes).

Formula:

Speed Factor = Audio Duration ÷ API Response Time

Price per 1,000 Minutes

Represents the cost of transcribing 1,000 minutes of audio (real-time duration).

Note:

  • Some providers (e.g. Replicate, fal) charge by processing time instead of audio duration.
  • For each model offered by these providers, we:
    • Test ~100 varied audio files.
    • Calculate average processing time per audio minute.
    • Multiply that time by the provider's per-minute processing cost.

Word Error Rate (WER) Evaluation

Artificial Analysis conducts independent WER testing across all model-provider pairs.

Key details:

  • API endpoints tested directly to reflect end-user performance.
  • Current evaluation: ~2 hours of audio across three datasets (details below).
  • Files ≥5s in length. Chunking (splitting of audio to enable processing of longer files) depends on model limits:
    • 10-minute limit models (Whisper L v2, Deepgram, GPT-4o Transcribe, GPT-4o Mini, Voxtral Mini, Gemini 2.5 Flash Lite): chunked into 9-minute segments.
    • Shorter-limit models (Qwen3 ASR Flash, Granite 3.3): chunked into 30-second segments.
  • For models with multiple providers (e.g. Whisper), a median WER is reported if results are within 0.4% variance.
  • When models allow prompts, we provide: "Transcribe the audio verbatim, outputting only spoken words in sequence."

Datasets and samples

  1. AMI: Multi-speaker meeting recordings from a single distant microphone, with less structured dialogue in noisy environments

    • Subset: SDM
    • Number of samples: 968
    • Sample duration range: 5-24 seconds
    • Total duration: 120 minutes
  2. Vox Populi: European parliamentary proceedings

    • Subset: English
    • Number of samples: 636
    • Sample duration range: 5-39 seconds
    • Total duration: 120 minutes
  3. Earnings 22: Corporate earnings calls with technical language and overlapping speakers

    • Subset: Full
    • Number of samples: 17
    • Sample duration range: 14-23 minutes
    • Total duration: 118.8 minutes

Normalization Process

Before comparison, both the reference (ground truth) and the model transcription hypothesis are normalized using OpenAI's Whisper normalizer:

  • Lowercase all text; remove bracketed text and filler words ("uh", "um"); normalize whitespace
  • Expand contractions (won't → will not, I'm → I am)
  • Standardize numbers (twenty five → 25, two point five → 2.5)
  • Normalize punctuation/symbols (remove non-semantic marks, standardize currency/percentages)
  • Standardize spelling (e.g. colour → color)

Example:

  • Ground truth: Hello, I'm Dr. Smith. I have twenty-five patients today.
  • Model output: hello i am doctor smith i have 25 patients today
  • After normalization: hello i am doctor smith i have 25 patients today

WER Calculation

  • WER is based on Levenshtein distance, which aligns two sequences by finding the optimal number of substitutions, insertions, and deletions required to transform one sequence into the other.
  • In this context, it measures the edits needed to change the model's transcription (hypothesis) back into the verified human transcript (reference).
  • For implementation we use the widely adopted jiwer library, referenced by organizations including OpenAI.
  • Results are aggregated as an audio-duration-weighted average WER so that numerous short clips do not bias results compared to longer files.

Testing Methodology

  • Benchmarks run four times daily.
  • Metrics reported as median of prior 14 days.
  • When polling for results, we follow the provider's library or otherwise poll every second. For 10-minute files, a 1s variation has negligible impact.
  • Only file-based transcription tested (not live-audio streaming).

Model Inclusion Criteria

Our objective is to represent the most popular and best-performing speech to text models to support end-users in choosing which models and providers to use. As such, we apply an 'industry significance' and competitive performance test to evaluate inclusion of new models & providers. We are in the process of refining these criteria and welcome any feedback and suggestions. To suggest models, please contact us via the contact page.

Statement of Independence

Benchmarking is conducted with strict independence and objectivity. No compensation is received from any providers for listing or favorable outcomes on Artificial Analysis.

Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn