March 14, 2026
NVIDIA Nemotron 3 VoiceChat: Leading the Open Weights Frontier of Conversational Dynamics vs. Speech Reasoning
Understanding Speech to Speech model performance is multidimensional - two key and distinct dimensions are raw intelligence and conversational dynamics: how well a model handles the natural rhythms of human conversation such as turn-taking, interruptions.
Amongst full duplex open weights models, NVIDIA’s new Nemotron 3 VoiceChat, V1, leads in balancing these dimensions, setting itself apart from other models on the Conversational Dynamics vs. Speech Reasoning pareto frontier.

Key benchmarking results:
➤ Conversational Dynamics (Full Duplex Bench): Nemotron 3 VoiceChat (V1) scores 77.8%, second among open weights speech to speech models behind NVIDIA's own PersonaPlex (91.0%) and ahead of FLM-Audio (62.0%), Moshi (61.0%) and Freeze-Omni (58.7%)

➤ Speech Reasoning (Big Bench Audio): Nemotron 3 VoiceChat (V1) scores 29.2%, second among open weights speech to speech models behind Freeze-Omni (33.9%) and well ahead of PersonaPlex (12.6%), FLM-Audio (5.3%) and Moshi (1.7%)

➤ Pareto leader: While Freeze-Omni leads on speech reasoning and PersonaPlex leads on conversational dynamics, Nemotron 3 VoiceChat (V1) is the only open weights model that performs amongst the top 3 on both - making it the clear leader on the pareto frontier between these two critical dimensions
➤ Larger than other open weights models but still relatively small compared to LLMs: Nemotron 3 VoiceChat (V1) has 12B parameters, making it one of the larger open weights speech to speech models, while NVIDIA's PersonaPlex is ~7B. While larger compared to other larger open weights speech to speech models the model still is relatively small compared to leading LLMs
➤ Context vs. proprietary models: While this release materially advances open weights performance, open weights speech to speech models still significantly underperform leading proprietary offerings. For comparison, proprietary models on our Big Bench Audio benchmark score substantially higher - Step-Audio R1.1 at 96%, Grok Voice Agent at 92%, Gemini 2.5 Flash (Thinking) at 92%, and Nova 2.0 Sonic at 87%. The gap between open weights and proprietary remains large in this modality.

As the capability and adoption of Speech to Speech models increases, we expect to expand our set of benchmarks to include elements such as tool-calling and multi-turn instruction following.
Resources:
Apply to get access to the model here: https://developer.nvidia.com/nemotron-voicechat-early-access
Experience a live conversation here: https://build.nvidia.com/nvidia/nemotron-voicechat
For more details on leading Speech to Speech models: https://artificialanalysis.ai/models/speech-to-speech
Read the latest

NVIDIA Nemotron 3 Ultra released: fast, intelligent, and open
NVIDIA released Nemotron 3 Ultra today - the most intelligent open weights model from a US lab
June 4, 2026

Fun-Realtime-TTS: New Text to Speech model topping Artificial Analysis leaderboard
Alibaba's Fun-Realtime-TTS takes the #1 spot on the Artificial Analysis Speech Arena Leaderboard, surpassing Google's Gemini 3.1 Flash TTS and Inworld's Realtime TTS-2 Research Preview
June 3, 2026

MAI-Transcribe-1.5: New Speech to Text model leading the accuracy-speed Pareto frontier
Microsoft has released MAI-Transcribe-1.5: an exceptionally fast speech transcription model at a speed factor of ~276x, while still achieving 2.4% on AA-WER (#3), leading the accuracy-speed Pareto frontier
June 2, 2026