ChatGPT's Advanced Voice Mode gets a significant update to make it sound more natural

OpenAI introduced Advanced Voice Mode last year alongside the launch of GPT-4o. This feature uses natively multimodal models, such as GPT-4o, and can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, similar to human response time in a typical conversation. It can also generate audio that feels more natural, pick up on non-verbal cues, such as the speed you’re talking, and respond with emotion.

Early this year, OpenAI released a minor update to Advanced Voice Mode that reduced interruptions and improved accents. Today, OpenAI has launched a significant upgrade to Advanced Voice Mode, making it sound even more natural and human-like. Responses now feature subtler intonation, realistic cadence—including pauses and emphasis—and more accurate expressiveness for certain emotions such as empathy and sarcasm.

Wow, new expressive voice in ⁦⁦@ChatGPTapp⁩ doesn’t just talk, it performs. Feels less like an AI and more like a human friend. Nice work ⁦@OpenAI⁩ team. 🎤🎶🚀 pic.twitter.com/LRkKNs3g3C
— Shaun Ralston (@shaunralston) June 7, 2025

This update also introduces support for translation. ChatGPT users can now use Advanced Voice Mode to translate between languages. Simply ask ChatGPT to start translating, and it will continue translating throughout the conversation until instructed to stop. This feature effectively replaces the need for dedicated voice translation apps.

For now, the updated Advanced Voice Mode is available only to ChatGPT paid users. OpenAI also noted that there are some known limitations with this latest update, outlined below.

This update may occasionally result in minor reductions in audio quality, such as unexpected variations in tone and pitch—especially noticeable with certain voice options. OpenAI expects to improve audio consistency over time.
Rare hallucinations in Voice Mode still persist, sometimes producing unintended sounds resembling ads, gibberish, or background music.

While some minor limitations remain, the steady stream of improvements points to a future where the line between human and AI conversation becomes increasingly indistinguishable.