Grok Speech to Text
Grok Speech to Text

Grok Speech to Text: API Provider Benchmarking & Analysis

Creator:xAI
License:Proprietary
Visit
Analysis of Grok Speech to Text API providers across performance metrics including Artificial Analysis Word Error Rate Index, speed, and price.

Highlights

AA-WER v2 · % of words transcribed incorrectly · Lower is better
Input audio seconds transcribed per second · Higher is better
USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate (AA-WER) Index by API

Artificial Analysis Word Error Rate (AA-WER) Index by API

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)
Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: Nova 2 Pro, Amazon; Voxtral Mini Transcribe, Mistral; GPT-4o Transcribe, OpenAI; GPT-4o Mini Transcribe, OpenAI). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio
Most attractive quadrant
Amazon Transcribe
Canary Qwen 2.5B, Replicate
Enhanced
Gemini 3 Flash (High)
Gemini 3.1 Pro Preview (High)
Gemini 3.1 Pro Preview (Low)
GPT-4o Mini Transcribe
GPT-4o Transcribe
Grok Speech to Text, xAI
MAI-Transcribe-1
MAI-Transcribe-1.5
Nova 2 Pro
Nova-3
Parakeet TDT 0.6B V3, Togetherai
Pulse STT
Rev AI
Scribe v2
Solaria-1
Universal
Universal-3 Pro
Voxtral Mini Transcribe
Voxtral Mini Transcribe 2
Voxtral Small
Whisper (L, v3), fal.ai
Whisper (L, v3), Fireworks
Whisper Large v3, together.ai
Wizper (L, v3), fal.ai

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Speed Factor

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price

Price of Transcription

USD per 1000 minutes of audio · Lower is better

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Summary of Key Metrics & Further Information

Provider
Further Details
Qwen3.5 Omni Flash
Qwen3.5 Omni Flash logoAlibaba Cloud
13.5%
81.2
0.00
Qwen3.5 Omni Plus
Qwen3.5 Omni Plus logoAlibaba Cloud
3.5%
94.9
0.00
Nova 2 Pro
Nova 2 Pro logoAmazon Bedrock
4.9%
22.6
3.10
Amazon Transcribe
Amazon Transcribe logoAmazon Bedrock
4.1%
19.0
24.00
Universal-3 Pro
Universal-3 Pro logoAssemblyAI
3.1%
92.8
3.50
Universal, AssemblyAI
Universal, AssemblyAI logoAssemblyAI
3.8%
112.4
2.50
MAI-Transcribe-1.5
MAI-Transcribe-1.5 logoMicrosoft Azure
2.4%
271.9
6.00
MAI-Transcribe-1
MAI-Transcribe-1 logoMicrosoft Azure
2.6%
55.3
6.00
Nova-3
Nova-3 logoDeepgram
5.2%
445.2
4.30
Nova-2
Nova-2 logoDeepgram
5.3%
500.9
4.30
Base
Base logoDeepgram
10.7%
330.3
12.50
Scribe v2
Scribe v2 logoElevenLabs
2.2%
38.3
3.67
Scribe v1
Scribe v1 logoElevenLabs
3.0%
41.1
6.67
Solaria-1, Gladia
Solaria-1, Gladia logoGladia
4.1%
61.1
4.07
Gemini 3.1 Pro Preview (High)
Gemini 3.1 Pro Preview (High) logoGoogle
2.8%
6.4
18.15
Gemini 3.1 Pro Preview (Low)
Gemini 3.1 Pro Preview (Low) logoGoogle
3.6%
7.1
7.72
Gemini 3 Flash (High)
Gemini 3 Flash (High) logoGoogle
2.9%
17.9
13.70
Gemini 2.5 Flash Lite
Gemini 2.5 Flash Lite logoGoogle
5.2%
68.5
6.56
Gemini 2.5 Flash
Gemini 2.5 Flash logoGoogle
5.1%
66.1
6.66
Gemini 2.5 Pro
Gemini 2.5 Pro logoGoogle
2.9%
13.3
11.39
Gemini 2.0 Flash Lite
Gemini 2.0 Flash Lite logoGoogle
3.8%
56.1
0.19
Gemini 3.1 Flash-Lite Preview (Minimal)
Gemini 3.1 Flash-Lite Preview (Minimal) logoGoogle
3.4%
68.2
5.83
Gradium Speech-to-Text
Gradium Speech-to-Text logoGradium
8.4%
2.3
13.00
Grok Speech to Text, xAI
Grok Speech to Text, xAI logoxAI
4.0%
72.3
1.67
Voxtral Mini Transcribe 2
Voxtral Mini Transcribe 2 logoMistral
3.6%
71.6
3.00
Voxtral Mini Transcribe
Voxtral Mini Transcribe logoMistral
3.5%
54.9
2.00
Voxtral Small
Voxtral Small logoMistral
2.8%
66.4
4.00
Voxtral Mini
Voxtral Mini logoDeepInfra
3.8%
69.9
1.00
Modulate STT Batch English VFast
Modulate STT Batch English VFast logoModulate
13.0%
198.5
0.42
Parakeet TDT 0.6B V3, Togetherai
Parakeet TDT 0.6B V3, Togetherai logoTogether.ai
4.5%
918.0
1.50
Canary Qwen 2.5B, NVIDIA
Canary Qwen 2.5B, NVIDIA logoReplicate
4.3%
5.3
0.74
Parakeet TDT 0.6B V2, NVIDIA
Parakeet TDT 0.6B V2, NVIDIA logoNVIDIA
6.4%
103.2
0.00
Parakeet RNNT 1.1B
Parakeet RNNT 1.1B logoReplicate
5.4%
5.9
1.91
GPT-4o Transcribe
GPT-4o Transcribe logoOpenAI
4.0%
31.8
6.00
GPT-4o Mini Transcribe
GPT-4o Mini Transcribe logoOpenAI
4.5%
45.4
3.00
Rev AI
Rev AI logoRev AI
5.9%
12.6
3.33
Smallest Pulse
Smallest Pulse logoSmallest.ai
4.4%
139.8
5.00
Speechmatics Standard
Speechmatics Standard logoSpeechmatics
5.1%
69.6
4.00
Speechmatics Enhanced
Speechmatics Enhanced logoSpeechmatics
4.0%
52.6
6.70
Whisper Large v3 Turbo
Whisper Large v3 Turbo logoGroq
4.6%
130.6
0.67
Whisper Large v3 Turbo
Whisper Large v3 Turbo logoFireworks
4.7%
224.1
1.00
Wizper Large v3
Wizper Large v3 logofal.ai
4.7%
268.0
0.50
Incredibly Fast Whisper
Incredibly Fast Whisper logoReplicate
5.7%
55.7
1.49
Whisper Large v3
Whisper Large v3 logoReplicate
10.1%
2.8
4.23
Whisper Large v3
Whisper Large v3 logofal.ai
4.1%
85.5
1.15
Whisper Large v3
Whisper Large v3 logoFireworks
4.6%
300.3
1.00
Whisper Large v3
Whisper Large v3 logoTogether.ai
4.5%
430.3
1.50
Whisper Large v2
Whisper Large v2 logoOpenAI
4.1%
27.4
6.00

Speech to Text providers compared: xAI.