Menu

logo
Artificial Analysis
HOME

Speech to Text AI Model & Provider Leaderboard

Analysis and comparison of Speech to Text transcription models & API providers. Artificial Analysis has analyzed speech to text models and hosting providers across different characteristics including their word error rate (lower is better), speed and price. Speed is represented by 'Speed Factor' which is the number of audio seconds transcribed per second (higher is better). For further details, see our methodology page.

Speech-to-text models & providers compared: Whisper (L, v2), OpenAI, Universal-1, Standard, Whisper (L, v2), Azure, Enhanced, Nano, Wizper (L, v3), fal.ai, Incredibly Fast Whisper, Replicate, Nova-2, Whisper (L, v2), Replicate, Whisper (L, v3), Replicate, Base, WhisperX, Replicate, Whisper (L v2), Deepgram, Whisper (L, v3), Groq, Distil-Whisper, Groq, Whisper (L, v3), fal.ai, Whisper (L, v3), Deepinfra, Whisper (L, v3, Turbo), Groq, Whisper (L, v3), Fireworks, Whisper (L, v3, Turbo), Fireworks, Universal-2, Amazon Transcribe, Fish Speech to Text, Nova-3, Chirp, Chirp 2, Scribe, GPT-4o Transcribe, and GPT-4o Mini Transcribe.

Highlights

Word error rate
Word error rate: % of words transcribed incorrectly, Lower is better
Speed Factor
Speed factor: Input audio seconds transcribed per second, Higher is better
Price
Price: USD per 1000 minutes of audio, Lower is better

Summary Analysis

Word Error Rate vs. Price

Word error rate: % of words transcribed incorrectly, Price: USD per 1000 minutes of audio
Most attractive quadrant
Whisper (L, v2), OpenAI
Enhanced
Wizper (L, v3), fal.ai
Whisper (L, v3, Turbo), Groq
Whisper (L, v3, Turbo), Fireworks
Universal-2
Amazon Transcribe
Nova-3
Chirp 2
Scribe
GPT-4o Transcribe
GPT-4o Mini Transcribe
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.

Word Error Rate vs. Speed Factor

Word error rate: % of words transcribed incorrectly, Speed factor: Input audio seconds transcribed per second
Most attractive quadrant
Whisper (L, v2), OpenAI
Enhanced
Wizper (L, v3), fal.ai
Whisper (L, v3, Turbo), Groq
Whisper (L, v3, Turbo), Fireworks
Universal-2
Amazon Transcribe
Nova-3
Chirp 2
Scribe
GPT-4o Transcribe
GPT-4o Mini Transcribe
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).

Speed Factor vs. Price

Speed factor: Input audio seconds transcribed per second, Price: USD per 1000 minutes of audio
Most attractive quadrant
Whisper (L, v2), OpenAI
Enhanced
Wizper (L, v3), fal.ai
Whisper (L, v3, Turbo), Groq
Whisper (L, v3, Turbo), Fireworks
Universal-2
Amazon Transcribe
Nova-3
Chirp 2
Scribe
GPT-4o Transcribe
GPT-4o Mini Transcribe
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.

Word Error Rate

Word error rate: % of words transcribed incorrectly, Lower is better
Word Error Rate: Percentage of words incorrect in the transcription. Evaluation updated June 2024 to 5,000 test samples.
Artificial Analysis' independent evaluation is based on Common Voice v16.1, Mozilla's leading open-source speech to text dataset. Further detail present on methodology page.

Speed Factor

Speed factor: Input audio seconds transcribed per second, Higher is better
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).

Speed Factor, Variance

Speed factor: Input audio seconds transcribed per second, Results by percentile, Higher is better
Median, Other points represent 5th, 25th, 75th, 95th Percentiles respectively
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Boxplot: Shows variance of measurements
Picture of the author

Speed Factor, Over Time

Speed factor: Input audio seconds transcribed per second, Higher is better
Whisper (L, v2), OpenAI
Enhanced
Wizper (L, v3), fal.ai
Whisper (L, v3, Turbo), Groq
Whisper (L, v3, Turbo), Fireworks
Universal-2
Amazon Transcribe
Nova-3
Chirp 2
Scribe
GPT-4o Transcribe
GPT-4o Mini Transcribe
Speed Factor: Audio file seconds transcribed per second of processing time. Higher factor indicates faster transcription speed.
Artificial Analysis' measurements are based on a audio duration of 10 minutes. Speed Factor may vary for other durations, particuarly for very short durations (under 1 minute).
Over time measurement: Median measurement per day, based on 8 measurements each day at different times. Labels represent start of week's measurements.

Price

Price: USD per 1000 minutes of audio, Lower is better
Price: Cost in USD per 1000 minutes of audio transcribed. Reflects the pricing model of the transcription service or software.
For providers which do not price based on audio duration and rather on processing time (incl. Replicate, fal), we have calculated an indicative per minute price based on processing time expected per minute of audio.Further detail present on methodology page.
Note: Groq chargers for a minimum of 10s per request.
Summary of key metrics & further information
ProviderFurther
Details
Whisper Large v2 logoOpenAI
Whisper Large v2 logoMicrosoft Azure
Whisper Large v3 logofal.ai
Incredibly Fast Whisper logoReplicate
Whisper Large v2 logoReplicate
Whisper Large v3 logoReplicate
WhisperX logoReplicate
Whisper Large v3 logoGroq
Distil-Whisper logoGroq
Whisper Large v3 logofal.ai
Whisper Large v3 logoDeepinfra
Whisper Large v3 Turbo logoGroq
Whisper Large v3 logoFireworks
Whisper Large v3 Turbo logoFireworks
Universal-1 logoAssemblyAI
Nano logoAssemblyAI
Universal-2 logoAssemblyAI
Standard logoSpeechmatics
Enhanced logoSpeechmatics
Nova-2 logoDeepgram
Base logoDeepgram
Whisper Large v2 logoDeepgram
Nova-3 logoDeepgram
Amazon Transcribe logoAmazon Bedrock
Fish Speech to Text logoFish Audio
Chirp logoGoogle
Chirp 2 logoGoogle
Scribe logoElevenLabs
GPT-4o Transcribe logoOpenAI
GPT-4o Mini Transcribe logoOpenAI