Stay connected with us on X, Discord, and LinkedIn to stay up to date with future analysis

Azure Speech Service: API Provider Benchmarking & Analysis

Analysis of Azure Speech Service API providers across performance metrics including Artificial Analysis Word Error Rate Index, speed, and price.

Creator:

Microsoft Azure

License:

Open

Link:

Visit

Word Error Rate Index

AA-WER v2.0; % of words transcribed incorrectly; lower is better.

Speed Factor

Input audio seconds transcribed per second; Higher is better

Price

USD per 1000 minutes of audio; Lower is better

Navigation

AA-WER API Benchmarks Speed Factor Price

Artificial Analysis Word Error Rate (AA-WER) Index by API

% of words transcribed incorrectly; lower is better. AA-WER v2.0 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%)

Note: For Earnings22, if a model cannot reliably handle full-length audio due to time limits, we chunk to ~9 minutes (relevant to: GPT-4o Transcribe, OpenAI).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~8 hours from three datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), and Earnings22-Cleaned-AA (25%). See methodology for more detail.

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly; lower is better. AA-WER v2.0 incorporates 3 datasets: AA-AgentTalk (50%), VoxPopuli-Cleaned-AA (25%), Earnings22-Cleaned-AA (25%), USD per 1000 minutes of audio

Most attractive quadrant

Enhanced

Gemini 3 Flash

Gemini 3 Pro

GPT-4o Transcribe

Scribe v1

Scribe v2

Solaria-1

Universal

Universal-3 Pro

Voxtral Small

Whisper (L, v3), Fireworks

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Speed Factor

Input audio seconds transcribed per second, Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Price

Price of Transcription

USD per 1000 minutes of audio, Lower is better

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.

Note: Groq chargers for a minimum of 10s per request.

Summary of Key Metrics & Further Information

Provider	Model	Whisper version	Word Error Rate (%)	Median Speed Factor	Price (USD per 1000 minutes)	Further Details
OpenAI	Whisper Large v2	large-v2	4.2%	29.3	6.00	Details
fal.ai	Wizper Large v3	large-v3	4.9%	249.5	0.50	Details
Replicate	Incredibly Fast Whisper	large-v3	5.8%	56.2	1.49	Details
Replicate	Whisper Large v3	large-v3	10.2%	2.8	4.23	Details
fal.ai	Whisper Large v3	large-v3	4.3%	41.9	1.15	Details
Groq	Whisper Large v3 Turbo	v3 Turbo	4.8%	388.8	0.67	Details
Fireworks	Whisper Large v3	large-v3	4.8%	205.6	1.00	Details
Fireworks	Whisper Large v3 Turbo	v3 Turbo	4.8%	380.8	1.00	Details
Together.ai	Whisper Large v3	large-v3	7.4%	121.2	1.50	Details
Speechmatics	Speechmatics Standard		5.3%	45.0	4.00	Details
Speechmatics	Speechmatics Enhanced		4.3%	24.8	6.70	Details
Deepgram	Nova-2		5.6%	425.8	4.30	Details
Deepgram	Base		10.9%	507.8	12.50	Details
Deepgram	Nova-3		6.5%	241.6	4.30	Details
AssemblyAI	Universal, AssemblyAI		4.0%	118.6	2.50	Details
AssemblyAI	Slam-1		4.1%	79.3	4.50	Details
AssemblyAI	Universal-3 Pro		3.3%	71.1	3.50	Details
Amazon Bedrock	Amazon Transcribe		4.3%	18.1	24.00	Details
Google	Chirp		31.3%	14.0	16.00	Details
Google	Chirp 2, Google		6.0%	19.4	16.00	Details
Google	Chirp 3, Google		4.6%	23.0	16.00	Details
ElevenLabs	Scribe v1		3.2%	34.7	6.67	Details
ElevenLabs	Scribe v2		2.3%	33.8	6.67	Details
Google	Gemini 2.0 Flash		4.0%	50.3	1.40	Details
Google	Gemini 2.0 Flash Lite		4.0%	49.5	0.19	Details
Google	Gemini 2.5 Flash Lite		5.3%	70.7	0.58	Details
Google	Gemini 2.5 Flash		5.3%	55.4	1.92	Details
Google	Gemini 2.5 Pro		3.1%	12.3	4.80	Details
Google	Gemini 3 Pro		2.9%	5.4	7.68	Details
Google	Gemini 3 Flash		3.1%	11.9	1.92	Details
OpenAI	GPT-4o Transcribe		4.1%	33.0	6.00	Details
OpenAI	GPT-4o Mini Transcribe		4.6%	50.2	3.00	Details
Replicate	Parakeet RNNT 1.1B		5.0%	6.2	1.91	Details
NVIDIA	Parakeet TDT 0.6B V2, NVIDIA		6.8%	58.8	0.00	Details
Replicate	Canary Qwen 2.5B, NVIDIA		4.4%	5.7	0.74	Details
Mistral	Voxtral Mini		3.7%	69.0	1.00	Details
Mistral	Voxtral Small		3.0%	67.3	4.00	Details
DeepInfra	Voxtral Mini		4.0%	83.3	1.00	Details
Gladia	Solaria-1, Gladia		4.2%	51.2	8.33	Details
Amazon Bedrock	Nova 2 Omni		5.9%	34.9	1.85	Details
Amazon Bedrock	Nova 2 Pro		5.0%	23.4	3.10	Details

Speech to Text providers compared: OpenAI, Speechmatics, fal.ai, Replicate, Deepgram, Groq, Fireworks, AssemblyAI, Amazon Bedrock, Google, ElevenLabs, Together.ai, Mistral, DeepInfra, NVIDIA, and Gladia.

Azure Speech Service: API Provider Benchmarking & Analysis

Navigation

Artificial Analysis Word Error Rate (AA-WER) Index by API

Artificial Analysis Word Error Rate (AA-WER) Index by API

Artificial Analysis Word Error Rate (AA-WER) Index

API Benchmarks

Artificial Analysis Word Error Rate Index vs. Price

Artificial Analysis Word Error Rate (AA-WER) Index

Price

Speed Factor

Speed Factor

Speed Factor

Speed Factor Detail

Price

Price of Transcription

Price

Price Providers Detail