Grok Speech to Text

Grok Speech to Text: API Provider Benchmarking & Analysis

Name: WER Index (Non-streaming)
Creator: Artificial Analysis
License: https://artificialanalysis.ai/docs/legal/Terms-of-Use.pdf

Creator:SpaceXAI

License:Proprietary

Visit

Analysis of Grok Speech to Text API providers across performance metrics including Artificial Analysis Word Error Rate Index, speed, and price.

Highlights

WER Index (Non-streaming)

AA-WER v2 · % of words transcribed incorrectly · Lower is better

Speed Factor

Input audio seconds transcribed per second · Higher is better

Price

USD per 1000 minutes of audio · Lower is better

Artificial Analysis Word Error Rate (AA-WER) Index by API

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)

Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: Nova 2 Pro, Amazon; GPT-4o Transcribe, OpenAI; GPT-4o Mini Transcribe, OpenAI). For models with even shorter time limits, we chunk to ~30 seconds (relevant to: Canary Qwen 2.5B, NVIDIA).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly · Lower is better · AA-WER v2 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%) · USD per 1000 minutes of audio

Most attractive quadrant

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Speed Factor

Input audio seconds transcribed per second · Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed. Reported Speed Factor values are medians across benchmark trials from the last 7 days; over-time chart points are daily medians. Artificial Analysis measurements are based on an audio duration of 10 minutes. Speed Factor may vary for other durations, particularly very short durations under 1 minute.

Price

Price of Transcription

USD per 1000 minutes of audio

Estimated cost in USD to transcribe 1,000 minutes of audio, normalized across providers with different billing models, and including billed reasoning tokens where available. Further detail on the methodology page.

Summary of Key Metrics & Further Information

	Provider	Whisper Version				Further Details
Qwen3.5 Omni Flash	Alibaba Cloud		13.5%	79.2	0.00	Details
Qwen3.5 Omni Plus	Alibaba Cloud		3.5%	98.2	0.00	Details
Nova 2 Pro	Amazon Bedrock		4.9%	22.8	3.10	Details
Amazon Transcribe	Amazon Bedrock		4.1%	13.2	6.00	Details
Universal-3 Pro	AssemblyAI		3.1%	113.0	3.50	Details
Universal, AssemblyAI	AssemblyAI		3.8%	123.7	2.50	Details
MAI-Transcribe-1.5	Microsoft Azure		2.4%	195.6	6.00	Details
MAI-Transcribe-1	Microsoft Azure		2.6%	67.3	6.00	Details
transcribe-03-2026	Cohere		4.6%	126.3	0.00	Details
Nova-3	Deepgram		5.2%	419.6	4.30	Details
Scribe v2	ElevenLabs		2.2%	50.7	3.67	Details
Solaria-1, Gladia	Gladia		4.1%	80.4	10.17	Details
Solaria-3, Gladia	Gladia		3.2%	62.1	10.16	Details
Gemini 3.1 Pro Preview (High)	Google		2.8%	7.5	18.15	Details
Gemini 3.1 Pro Preview (Low)	Google		3.6%	7.2	7.72	Details
Gemini 3 Flash (High)	Google		2.9%	17.4	13.70	Details
Gemini 2.5 Flash Lite	Google		5.2%	62.9	6.56	Details
Gemini 2.5 Flash	Google		5.1%	75.5	6.66	Details
Gemini 2.5 Pro	Google		2.9%	13.9	11.39	Details
Gemini 3.1 Flash-Lite Preview (Minimal)	Google		3.4%	75.0	5.83	Details
Gradium Speech-to-Text	Gradium		6.8%	2.3	13.00	Details
Grok Speech to Text, SpaceXAI	SpaceXAI		4.0%	189.3	1.67	Details
Voxtral Mini Transcribe 2	Mistral		3.6%	78.3	3.00	Details
Voxtral Small	Mistral		2.8%	66.7	4.00	Details
Voxtral Mini	DeepInfra		3.8%	79.8	1.00	Details
Modulate STT Batch English VFast	Modulate		4.2%	67.3	0.42	Details
Parakeet TDT 0.6B V3, Togetherai	Together AI		4.5%	893.1	1.50	Details
Canary Qwen 2.5B, NVIDIA	Replicate		4.3%	5.8	0.74	Details
Parakeet TDT 0.6B V2, NVIDIA	NVIDIA		6.4%	89.2	0.00	Details
Parakeet RNNT 1.1B	Replicate		5.4%	6.4	1.91	Details
GPT Transcribe, OpenAI	OpenAI		3.3%	35.0	4.50	Details
GPT-4o Transcribe	OpenAI		4.0%	36.7	6.00	Details
GPT-4o Mini Transcribe	OpenAI		4.5%	40.8	3.00	Details
Smallest AI Pulse Pro	Smallest.ai		2.4%	252.6	4.00	Details
Resonant-1	Reson8		3.4%	334.7	3.60	Details
Rev AI	Rev AI		5.9%	12.8	3.33	Details
Smallest AI Pulse	Smallest.ai		4.4%	231.6	5.00	Details
Soniox v5 Async	Soniox		3.8%	20.1	1.66	Details
Soniox V4	Soniox		3.9%	18.3	1.66	Details
Speechmatics Melia	Speechmatics		4.9%	73.7	4.00	Details
Speechmatics Standard	Speechmatics		5.1%	39.0	7.50	Details
Speechmatics Enhanced	Speechmatics		4.0%	36.6	12.50	Details
Whisper Large v3 Turbo	Groq	v3 Turbo	4.6%	108.2	0.67	Details
Wizper Large v3	fal.ai	large-v3	4.7%	225.3	0.50	Details
Incredibly Fast Whisper	Replicate	large-v3	5.7%	56.6	1.49	Details
Whisper Large v3	Replicate	large-v3	10.1%	2.8	4.23	Details
Whisper Large v3	fal.ai	large-v3	4.1%	50.5	1.15	Details
Whisper Large v3	Together AI	large-v3	4.5%	496.7	1.50	Details
Whisper Large v2	OpenAI	large-v2	4.1%	27.6	6.00	Details

Speech to Text providers compared: SpaceXAI.