June 3, 2026
Fun-Realtime-TTS: New Text to Speech model topping Artificial Analysis leaderboard
Alibaba's Fun-Realtime-TTS takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Google's Gemini 3.1 Flash TTS and Inworld's Realtime TTS-2 Research Preview
Competition at the top of the TTS Arena is tighter than ever, with just 24 Elo points separating the top five models. Fun-Realtime-TTS takes the top spot with the highest Elo score on the leaderboard.
Alibaba's previous Fun-Realtime-TTS-Preview reached #7 on the leaderboard, making this Alibaba's first #1 model in the Artificial Analysis Speech Arena. Fun-Realtime-TTS is available via Alibaba Cloud with API access for developers.
Key takeaways:
➤ Quality: Fun-Realtime-TTS has an Elo score of 1,219 (+16/-16) based on 962 arena appearances, placing it ahead of Gemini 3.1 Flash TTS at 1,214, Inworld Realtime TTS-2 Research Preview at 1,209, and Cartesia Sonic 3.5 at 1,203
➤ Pricing: Fun-Realtime-TTS is priced at $27.59/1M characters, positioning it between Gemini 3.1 Flash TTS at $18.3/1M characters and Inworld Realtime TTS 1.5 Max at $35/1M characters, while remaining below Sonic 3.5 at $39/1M characters.
➤ Features: Fun-Realtime-TTS supports real-time speech generation with voice cloning, voice design, multilingual output, and support for regional accents and dialects.

Fun-Realtime-TTS achieves the highest Elo score on the Speech Arena Leaderboard while remaining competitively priced at $27.6/1M characters, below several other frontier TTS models including Sonic 3.5 and Inworld Realtime TTS-2 Research Preview.

See the top models on the Artificial Analysis Speech leaderboard: https://artificialanalysis.ai/text-to-speech/leaderboard
Vote for models in the Speech Arena: https://artificialanalysis.ai/text-to-speech/arena
Explore sample clips from Fun-Realtime-TTS in the Speech Explorer: https://artificialanalysis.ai/text-to-speech/speech-explorer
Read the latest

MAI-Transcribe-1.5: New Speech to Text model leading the accuracy-speed Pareto frontier
Microsoft has released MAI-Transcribe-1.5: an exceptionally fast speech transcription model at a speed factor of ~276x, while still achieving 2.4% on AA-WER (#3), leading the accuracy-speed Pareto frontier
June 2, 2026

AA-WER Streaming: New Speech to Text Streaming Benchmark
Announcing AA-WER Streaming, our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia, ElevenLabs, and Deepgram
June 2, 2026

Nemotron 3 Ultra announced: high-speed, leading US open weights intelligence
NVIDIA just announced the release of Nemotron 3 Ultra in Jensen Huang's Computex keynote: at 550B parameters (55B active), this is the largest Nemotron 3 model to date, and it is the most intelligent US open weights model
June 1, 2026