Follow us on Twitter or LinkedIn to stay up to date with future analysis
Artificial AnalysisArtificial Analysis
For EnterpriseInsights
  • Artificial AnalysisArtificial Analysis
  • Hardware
  • AI Trends
  • Articles
For EnterpriseInsights
logo

Azure Speech Service: API Provider Benchmarking & Analysis

Analysis of Azure Speech Service API providers across performance metrics including Artificial Analysis Word Error Rate Index, speed, and price.
Creator:
Microsoft Azure
License:
Open
Link:
Visit

Highlights

Word Error Rate Index
% of words transcribed incorrectly; Lower is better
Speed Factor
Input audio seconds transcribed per second; Higher is better
Price
USD per 1000 minutes of audio; Lower is better

Navigation

Artificial Analysis Word Error Rate (AA-WER) Index by APIArtificial Analysis Word Error Rate (AA-WER) Index vs Other MetricsSpeed FactorPrice

Artificial Analysis Word Error Rate (AA-WER) Index by API

Back to Navigation

Artificial Analysis Word Error Rate (AA-WER) Index by API

% of words transcribed incorrectly, Lower is better
Note: Models that do not support transcription of audio longer than 10 minutes were evaluated on 9-minute chunks of the test set (applies to GPT-4o Transcribe; GPT-4o Mini Transcribe; Voxtral Mini; Voxtral Mini, Deepinfra; Gemini 2.5 Flash Lite). For models with even shorter time limits, all files are split into 30-second chunks (applies to Granite Speech 3.3 8B, IBM; Qwen3 ASR Flash).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Artificial Analysis Word Error Rate (AA-WER) Index by Individual Dataset

% of words transcribed incorrectly, Lower is better
VoxPopuli (Parliamentary proceedings)
Earnings 22 Full (Corporate earnings calls)
AMI SDM (Meeting transcripts)
Note: Models that do not support transcription of audio longer than 10 minutes were evaluated on 9-minute chunks of the test set (applies to GPT-4o Transcribe; GPT-4o Mini Transcribe; Voxtral Mini; Voxtral Mini, Deepinfra; Gemini 2.5 Flash Lite). For models with even shorter time limits, all files are split into 30-second chunks (applies to Granite Speech 3.3 8B, IBM; Qwen3 ASR Flash).

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Artificial Analysis Word Error Rate (AA-WER) Index vs Other Metrics

Back to Navigation

Artificial Analysis Word Error Rate Index vs. Price

% of words transcribed incorrectly, USD per 1000 minutes of audio
Most attractive quadrant
Size represents Input audio seconds transcribed per second
Amazon Transcribe
Azure AI Speech Service
Chirp 2, Google
Nova-3
Qwen3 ASR Flash
Rev AI
Scribe, ElevenLabs
Speechmatics Enhanced
Voxtral Small
Whisper (L, v3, Turbo), Groq
Whisper (L, v3), Fireworks

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Artificial Analysis Word Error Rate Index vs. Speed Factor

% of words transcribed incorrectly, Input audio seconds transcribed per second
Most attractive quadrant
Size represents USD per 1000 minutes of audio
Amazon Transcribe
Azure AI Speech Service
Chirp 2, Google
Nova-3
Parakeet TDT 0.6B V2, NVIDIA
Scribe, ElevenLabs
Speechmatics Enhanced
Voxtral Small
Whisper (L, v3, Turbo), Groq
Whisper (L, v3), Fireworks

Measures transcription accuracy across 3 datasets to evaluate models in real-world speech with diverse accents, domain-specific language, and challenging channel & acoustic conditions.

AA-WER is calculated as an audio-duration-weighted average of WER across ~2 hours from three datasets: VoxPopuli, Earnings-22, and AMI-SDM. See methodology for more detail.

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Speed Factor

Back to Navigation

Speed Factor

Input audio seconds transcribed per second, Higher is better

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Speed Factor Variance

Input audio seconds transcribed per second, Results by percentile, Higher is better
Median; other points represent Min, 25th, 75th percentiles and Max respectively

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Picture of the author

Speed Factor, Over Time

Input audio seconds transcribed per second, Higher is better
Azure AI Speech Service
Speechmatics Enhanced
Whisper (L, v3, Turbo), Groq
Whisper (L, v3), Fireworks
Amazon Transcribe
Rev AI
Nova-3
Chirp 2, Google
Scribe, ElevenLabs
Voxtral Small
Parakeet TDT 0.6B V2, NVIDIA
Gemini 2.5 Pro
Qwen3 ASR Flash

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Speed Factor vs. Price

Input audio seconds transcribed per second, USD per 1000 minutes of audio
Most attractive quadrant
Amazon Transcribe
Azure AI Speech Service
Chirp 2, Google
Nova-3
Scribe, ElevenLabs
Speechmatics Enhanced
Voxtral Small
Whisper (L, v3, Turbo), Groq
Whisper (L, v3), Fireworks

Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.

Artificial Analysis measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particularly for very short durations (under 1 minute).

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Price

Back to Navigation

Price of Transcription

USD per 1000 minutes of audio, Lower is better

Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.

Note: Groq chargers for a minimum of 10s per request.

Summary of Key Metrics & Further Information
ProviderModelWhisper versionFootnotesWord Error Rate (%)Median Speed FactorPrice (USD per 1000 minutes)Further
Details
Whisper Large v2 logoOpenAIWhisper Large v2large-v2
15.8%30.36.00

Details

Whisper Large v2 logoMicrosoft AzureWhisper Large v2large-v2
27.2%34.06.00

Details

Whisper Large v3 logofal.aiWhisper Large v3large-v3
16.8%286.70.50

Details

Incredibly Fast Whisper logoReplicateIncredibly Fast Whisperlarge-v3
18.2%63.51.49

Details

Whisper Large v2 logoReplicateWhisper Large v2large-v2
15.8%2.83.47

Details

Whisper Large v3 logoReplicateWhisper Large v3large-v3
24.6%2.84.23

Details

WhisperX logoReplicateWhisperXlarge-v3
16.3%17.91.09

Details

Whisper Large v3 logoGroqWhisper Large v3large-v3
16.8%320.51.85

Details

Distil-Whisper logoGroqDistil-Whisper
0.33

Details

Whisper Large v3 logoDeepinfraWhisper Large v3large-v3
16.8%104.20.45

Details

Whisper Large v3 logofal.aiWhisper Large v3large-v3
16.8%141.91.15

Details

Whisper Large v3 Turbo logoGroqWhisper Large v3 Turbov3 Turbo
395.80.67

Details

Whisper Large v3 logoFireworksWhisper Large v3large-v3
425.01.00

Details

Whisper Large v3 Turbo logoFireworksWhisper Large v3 Turbov3 Turbo
17.8%494.81.00

Details

Whisper-Large-v3 logoSambaNovaWhisper-Large-v3large-v3
16.8%123.51.67

Details

Whisper Large v3 logoTogether.aiWhisper Large v3large-v3
24.6%126.61.50

Details

Whisper v1 logoOpenAIWhisper v11
0.00

Details

Speechmatics Standard logoSpeechmaticsSpeechmatics Standard
16.0%43.14.00

Details

Speechmatics Enhanced logoSpeechmaticsSpeechmatics Enhanced
14.4%24.16.70

Details

Azure AI Speech Service logoMicrosoft AzureAzure AI Speech Service
17.2%2.016.67

Details

Nano logoAssemblyAINano
16.3%87.82.00

Details

Universal 2, AssemblyAI logoAssemblyAIUniversal 2, AssemblyAI
14.5%85.92.50

Details

Slam-1 logoAssemblyAISlam-1
15.2%31.24.50

Details

Nova-2 logoDeepgramNova-2
17.3%535.94.30

Details

Base logoDeepgramBase
21.9%491.912.50

Details

Nova-3 logoDeepgramNova-3
18.3%443.44.30

Details

Amazon Transcribe logoAmazon BedrockAmazon Transcribe
14.0%19.324.00

Details

Fish Speech to Text logoFish AudioFish Speech to Text
0.00

Details

Rev AI logoRev AIRev AI
15.2%20.00

Details

Chirp 2, Google logoGoogleChirp 2, Google
11.6%18.016.00

Details

Chirp logoGoogleChirp
16.9%14.316.00

Details

Chirp 3, Google logoGoogleChirp 3, Google
15.0%32.616.00

Details

Scribe, ElevenLabs logoElevenLabsScribe, ElevenLabs
46.66.67

Details

Gemini 2.0 Flash logoGoogleGemini 2.0 Flash
17.9%53.41.40

Details

Gemini 2.0 Flash Lite logoGoogleGemini 2.0 Flash Lite
16.6%57.30.19

Details

Gemini 2.5 Flash Lite logoGoogleGemini 2.5 Flash Lite
16.1%93.00.58

Details

Gemini 2.5 Flash logoGoogleGemini 2.5 Flash
19.2%99.21.92

Details

Gemini 2.5 Pro logoGoogleGemini 2.5 Pro
15.0%10.40.00

Details

GPT-4o Transcribe logoOpenAIGPT-4o Transcribe
21.3%28.06.00

Details

GPT-4o Mini Transcribe logoOpenAIGPT-4o Mini Transcribe
20.1%34.33.00

Details

Granite Speech 3.3 8B, IBM logoIBMGranite Speech 3.3 8B, IBM
15.7%0.00

Details

Parakeet RNNT 1.1B logoReplicateParakeet RNNT 1.1B
6.31.91

Details

Parakeet TDT 0.6B V2, NVIDIA logoNVIDIAParakeet TDT 0.6B V2, NVIDIA
63.60.00

Details

Canary Qwen 2.5B, NVIDIA logoReplicateCanary Qwen 2.5B, NVIDIA
13.2%5.50.74

Details

Voxtral Mini logoMistralVoxtral Mini
15.8%59.51.00

Details

Voxtral Small logoMistralVoxtral Small
14.7%65.24.00

Details

Voxtral Small logoDeepinfraVoxtral Small
14.7%25.53.00

Details

Voxtral Mini logoDeepinfraVoxtral Mini
15.8%81.01.00

Details

Qwen3 ASR Flash logoAlibaba CloudQwen3 ASR Flash
15.0%1.92

Details

Qwen3 Omni logoAlibaba CloudQwen3 Omni
52.3%0.00

Details

Qwen3 Omni Captioner logoAlibaba CloudQwen3 Omni Captioner
5.72

Details

Solaria-1, Gladia logoGladiaSolaria-1, Gladia
17.4%48.28.33

Details

Speech to Text providers compared: OpenAI, Speechmatics, Microsoft Azure, AssemblyAI, fal.ai, Replicate, Deepgram, Groq, Deepinfra, Fireworks, Amazon Bedrock, Fish Audio, Rev AI, Google, ElevenLabs, SambaNova, IBM, Together.ai, Mistral, NVIDIA, Alibaba Cloud, and Gladia.


Footer

Key Links

  • Compare Language Models
  • Language Models Leaderboard
  • Language Model API Leaderboard
  • Image Arena
  • Video Arena
  • Speech Arena

Artificial Analysis

  • FAQ
  • Contact & Data access
  • Terms of Use
  • Privacy Policy
  • hello@artificialanalysis.ai

Subscribe to our newsletter

TwitterLinkedIn