Azure Speech Service: API Provider Benchmarking & Analysis
Analysis of API providers of Azure Speech Service across performance metrics including word error rate, speed, and price.
API providers compared include OpenAI, AssemblyAI, Speechmatics, Microsoft Azure, fal.ai, Replicate, Deepgram, Gladia, Groq, Deepinfra, Fireworks, Amazon Bedrock, Rev AI, and Google.
API providers compared include OpenAI, AssemblyAI, Speechmatics, Microsoft Azure, fal.ai, Replicate, Deepgram, Gladia, Groq, Deepinfra, Fireworks, Amazon Bedrock, Rev AI, and Google.
Highlights
Word error rate
Word error rate: % of words transcribed incorrectly (June '24), Lower is better
Speed Factor
Speed factor: Input audio seconds transcribed per second, Higher is better
Price
Price: USD per 1000 minutes of audio, Lower is better
Summary analysis
Word Error Rate vs. Price
Word error rate: % of words transcribed incorrectly (June '24), Price: USD per 1000 minutes of audio
Most attractive quadrant
Size represents Speed factor: Input audio seconds transcribed per second
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Word Error Rate vs. Speed Factor
Word error rate: % of words transcribed incorrectly (June '24), Speed factor: Input audio seconds transcribed per second
Most attractive quadrant
Size represents Price: USD per 1000 minutes of audio
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Speed Factor vs. Price
Speed factor: Input audio seconds transcribed per second, Price: USD per 1000 minutes of audio
Most attractive quadrant
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Quality
Word Error Rate
Word error rate: % of words transcribed incorrectly (June '24), Lower is better
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.
Speed
Speed Factor
Speed factor: Input audio seconds transcribed per second, Higher is better
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Speed Factor, Variance
Speed factor: Input audio seconds transcribed per second, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Boxplot: Shows variance of measurements
Speed Factor, Over Time
Speed factor: Input audio seconds transcribed per second, Higher is better
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.
Price
Price
Price: USD per 1000 minutes of audio, Lower is better
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.
Note: Groq chargers for a minimum of 10s per request.
Note: Groq chargers for a minimum of 10s per request.
Summary of key metrics & further information
Provider | Model | Whisper version | Footnotes | Word Error Rate (%) | Median Speed Factor | Price (USD per 1000 minutes) | Further Details |
---|---|---|---|---|---|---|---|
OpenAI | Whisper (large-v2) | large-v2 | 10.6% | 34.6 | 6.00 | ||
Microsoft Azure | Whisper (large-v2) | large-v2 | 10.6% | 37.1 | 6.00 | ||
fal.ai | Wizper (Large v3) | large-v3 | 10.3% | 223.7 | 0.50 | ||
Replicate | Incredibly Fast Whisper | large-v3 | 10.3% | 44.3 | 1.49 | ||
Replicate | Whisper (large-v2) | large-v2 | 11.2% | 3.3 | 3.47 | ||
Replicate | Whisper (large-v3) | large-v3 | 10.3% | 3.0 | 4.23 | ||
Replicate | WhisperX | large-v3 | 10.9% | 21.2 | 1.09 | ||
Replicate | Whisper (medium) | medium | 12.8% | 3.8 | 2.68 | ||
Replicate | Whisper (small) | small | 17.0% | 2.6 | 1.37 | ||
Groq | Whisper (large-v3) | large-v3 | 10.3% | 163.5 | 1.85 | ||
Groq | Distil-Whisper, Groq | 13.0% | 201.0 | 0.33 | |||
Deepinfra | Whisper (large-v3), Deepinfra | large-v3 | 10.3% | 119.2 | 0.45 | ||
fal.ai | Whisper (large-v3) | large-v3 | 10.3% | 92.3 | 1.15 | ||
Deepinfra | Distil-Whisper, Deepinfra | 13.0% | 170.4 | 0.18 | |||
Groq | Whisper (large-v3 Turbo) | v3 Turbo | 12.0% | 187.5 | 0.67 | ||
Fireworks | Whisper (large-v3) | large-v3 | 0.0% | 185.5 | 0.00 | ||
Fireworks | Whisper (large-v3 Turbo) | v3 Turbo | 0.0% | 251.4 | 0.00 | ||
AssemblyAI | AssemblyAI (Universal-1) | 8.7% | 57.9 | 6.17 | |||
AssemblyAI | Nano | 12.7% | 81.5 | 2.00 | |||
Speechmatics | Speechmatics Standard | 12.6% | 17.7 | 13.33 | |||
Speechmatics | Speechmatics Enhanced | 8.6% | 9.1 | 17.33 | |||
Microsoft Azure | Azure Speech Service | 12.6% | 2.0 | 16.67 | |||
Deepgram | Nova-2 | 15.1% | 134.0 | 4.30 | |||
Deepgram | Base | 26.1% | 167.3 | 12.50 | |||
Deepgram | Whisper Large v2 | large-v2 | 10.6% | 29.1 | 4.80 | ||
Gladia | Gladia | whisper-v2-variant | 12.9% | 20.1 | 10.20 | ||
Amazon Bedrock | Amazon Transcribe | 11.2% | 23.2 | 24.00 | |||
Rev AI | Rev AI | 0.0% | 9.1 | 20.00 | |||
Cloud Speech-To-Text (Chirp) | 12.4% | 14.9 | 16.00 |