Grok Speech to Text: API Provider Benchmarking & Analysis
Analysis of Grok Speech to Text API providers across performance metrics including Artificial Analysis Word Error Rate Index, speed, and price.
Highlights
Artificial Analysis Word Error Rate (AA-WER) Index by API
Artificial Analysis Word Error Rate (AA-WER) Index by API
% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)
Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral; GPT-4o Transcribe, OpenAI; GPT-4o Mini Transcribe, OpenAI). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA).
API Benchmarks
Artificial Analysis Word Error Rate Index vs. Price
% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio
Most attractive quadrant
Amazon Transcribe
Canary Qwen 2.5B, Replicate
Enhanced
Gemini 3 Flash (High)
Gemini 3.1 Pro Preview (High)
Gemini 3.1 Pro Preview (Low)
GPT-4o Mini Transcribe
GPT-4o Transcribe
Grok Speech to Text, xAI
MAI-Transcribe-1
MAI-Transcribe-1.5
Nova 2 Pro
Nova-3
Parakeet TDT 0.6B V3, Togetherai
Pulse STT
Rev AI
Scribe v2
Solaria-1
Universal
Universal-3 Pro
Voxtral Mini Transcribe
Voxtral Mini Transcribe 2
Voxtral Small
Whisper (L, v3), fal.ai
Whisper (L, v3), Fireworks
Whisper Large v3, together.ai
Wizper (L, v3), fal.ai
Speed Factor
Speed Factor
Input audio seconds transcribed per second · Higher is better
Price
Price of Transcription
USD per 1000 minutes of audio · Lower is better
Summary of Key Metrics & Further Information
Provider | Further Details | ||||
|---|---|---|---|---|---|
Qwen3.5 Omni Flash | 13.5% | 81.2 | 0.00 | ||
Qwen3.5 Omni Plus | 3.5% | 94.9 | 0.00 | ||
Nova 2 Pro | 4.9% | 22.6 | 3.10 | ||
Amazon Transcribe | 4.1% | 19.0 | 24.00 | ||
Universal-3 Pro | 3.1% | 92.8 | 3.50 | ||
Universal, AssemblyAI | 3.8% | 112.4 | 2.50 | ||
MAI-Transcribe-1.5 | 2.4% | 271.9 | 6.00 | ||
MAI-Transcribe-1 | 2.6% | 55.3 | 6.00 | ||
Nova-3 | 5.2% | 445.2 | 4.30 | ||
Nova-2 | 5.3% | 500.9 | 4.30 | ||
Base | 10.7% | 330.3 | 12.50 | ||
Scribe v2 | 2.2% | 38.3 | 3.67 | ||
Scribe v1 | 3.0% | 41.1 | 6.67 | ||
Solaria-1, Gladia | 4.1% | 61.1 | 4.07 | ||
Gemini 3.1 Pro Preview (High) | 2.8% | 6.4 | 18.15 | ||
Gemini 3.1 Pro Preview (Low) | 3.6% | 7.1 | 7.72 | ||
Gemini 3 Flash (High) | 2.9% | 17.9 | 13.70 | ||
Gemini 2.5 Flash Lite | 5.2% | 68.5 | 6.56 | ||
Gemini 2.5 Flash | 5.1% | 66.1 | 6.66 | ||
Gemini 2.5 Pro | 2.9% | 13.3 | 11.39 | ||
Gemini 2.0 Flash Lite | 3.8% | 56.1 | 0.19 | ||
Gemini 3.1 Flash-Lite Preview (Minimal) | 3.4% | 68.2 | 5.83 | ||
Gradium Speech-to-Text | 8.4% | 2.3 | 13.00 | ||
Grok Speech to Text, xAI | 4.0% | 72.3 | 1.67 | ||
Voxtral Mini Transcribe 2 | 3.6% | 71.6 | 3.00 | ||
Voxtral Mini Transcribe | 3.5% | 54.9 | 2.00 | ||
Voxtral Small | 2.8% | 66.4 | 4.00 | ||
Voxtral Mini | 3.8% | 69.9 | 1.00 | ||
Modulate STT Batch English VFast | 13.0% | 198.5 | 0.42 | ||
Parakeet TDT 0.6B V3, Togetherai | 4.5% | 918.0 | 1.50 | ||
Canary Qwen 2.5B, NVIDIA | 4.3% | 5.3 | 0.74 | ||
Parakeet TDT 0.6B V2, NVIDIA | 6.4% | 103.2 | 0.00 | ||
Parakeet RNNT 1.1B | 5.4% | 5.9 | 1.91 | ||
GPT-4o Transcribe | 4.0% | 31.8 | 6.00 | ||
GPT-4o Mini Transcribe | 4.5% | 45.4 | 3.00 | ||
Rev AI | 5.9% | 12.6 | 3.33 | ||
Smallest Pulse | 4.4% | 139.8 | 5.00 | ||
Speechmatics Standard | 5.1% | 69.6 | 4.00 | ||
Speechmatics Enhanced | 4.0% | 52.6 | 6.70 | ||
Whisper Large v3 Turbo | 4.6% | 130.6 | 0.67 | ||
Whisper Large v3 Turbo | 4.7% | 224.1 | 1.00 | ||
Wizper Large v3 | 4.7% | 268.0 | 0.50 | ||
Incredibly Fast Whisper | 5.7% | 55.7 | 1.49 | ||
Whisper Large v3 | 10.1% | 2.8 | 4.23 | ||
Whisper Large v3 | 4.1% | 85.5 | 1.15 | ||
Whisper Large v3 | 4.6% | 300.3 | 1.00 | ||
Whisper Large v3 | 4.5% | 430.3 | 1.50 | ||
Whisper Large v2 | 4.1% | 27.4 | 6.00 |
Speech to Text providers compared: xAI.