Amazon’s Nova Sonic foundation model understands voice in a whole new way

Amazon has been on the forefront of developing voice-based technologies that empower conversational AI applications for over a decade, from building the world's best personal AI assistant like Alexa, to developing AWS services like Lex, Polly, and Connect. But for voice AI to drive even more real-world value for customers, it must account for the nuance and complexity of human conversation. When it comes to conversation, words have meaning, but words alone can fall flat without acoustic context that give them depth. How something is said is equally, if not more important, than what is said. Getting this right with AI has been a challenge—until now.

Today Amazon announced Amazon Nova Sonic, a new foundation model that unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in AI applications. Available via a new API in Amazon Bedrock, the model simplifies the development of voice applications, such as customer service call automation and AI agents across a broad range of industries, including travel, education, health care, entertainment, and more.

A speech system that gets tone, style, and pace

Traditional approaches in building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio. This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.

Nova Sonic takes a new approach to solve these challenges. Instead of using different models, it unifies the understanding and generation capabilities into a single model. This unification enables the model to adapt the generated voice response to the acoustic context (e.g., tone, style) and the spoken input, resulting in more natural dialogue. Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins.

Example of an AI agent for travel built on Amazon Nova Sonic:

In this conversation, a customer interacts with a virtual travel assistant about a trip to Hawaii. When the customer's tone shifts from excited to worried about costs, the AI's tone becomes more reassuring as it pulls relevant pricing information.

It also generates a text transcript for the user’s speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents, like this example of an AI-powered travel agent that can book flights by retrieving up to date flight information. These capabilities, along with its lightning-fast inference, make voice applications powered by Nova Sonic more natural and useful.

Example of an enterprise AI assistant built on Amazon Nova Sonic:

In this example, a dashboard AI assistant shows how enterprise customers can benefit from Nova Sonic's ability to ground responses in company data. The assistant pulls reports and shares accurate data in a natural, conversational tone while proactively asking relevant follow-up questions. The fluid dialogue enables multi-turn exchanges without requiring explicit context-setting from the speaker.

With the launch of Nova Sonic, Amazon continues to innovate with state-of-the-art foundation models that deliver real-world value for every Amazon customer.