Last year, Microsoft unveiled super-realistic AI voices designed for conversational applications, such as chatbots, voice assistants, gaming, and more. With the Azure Speech SDK or REST API, developers could integrate these neural text-to-speech (TTS) voices into their applications. In recent months, Microsoft has significantly expanded its offerings, now boasting over 500 neural voices across more than 140 languages and locales.
Today, Microsoft introduced an enhanced HD version of its neural text-to-speech service for select voices. These new HD voices enhance overall expressiveness through emotion detection that considers the context of the input text. Microsoft asserts that these latest HD voices utilize auto-regressive transformer language models, producing speech that aligns with the selected platform’s voice timbre. The advantages of the new HD voices include:
- Human-like speech generation: The upgraded model accurately interprets input text and understands the underlying sentiment, allowing it to adjust the speaking tone in real time to match the conveyed emotion.
- Conversational: This new model generates spontaneous pauses and emphasis. Microsoft highlights that it can replicate common phonemes such as pauses and filler words.
- Prosody variations: The HD voice system introduces slight variations in each output, enhancing realism by ensuring that every sentence sounds different from previously generated speech.
Garfield He, Cognitive Services Speech program manager at Microsoft, commented on the HD voice launch:
“With innovative technology that employs acoustic and linguistic features to generate speech characterized by rich, natural variations, it skillfully detects emotional cues within the text and autonomously adjusts the voice’s tone and style. This upgrade delivers a more human-like speech pattern marked by improved intonation, rhythm, and emotion.”
Sample audio content generated with this HD voice model can be found in the video below.
https://www.youtube.com/watch?v=UCYok4I4a24
The new HD voices are currently in preview for developers in three regions: East US, West Europe, and Southeast Asia. The cost for utilizing these HD voices is set at $30 per 1 million characters.
Source: Microsoft
Leave a Reply