Microsoft’s new AI voice model: A potential game changer for deepfakes

Enhancements in Azure AI Speech: Introducing DragonV2.1 Neural TTS Model

Microsoft has unveiled a significant upgrade to its Azure AI Speech capabilities with the launch of the DragonV2.1 Neural text-to-speech (TTS) model. This zero-shot model revolutionizes voice synthesis by enabling the generation of expressive and natural-sounding voices from minimal input data. The enhancements promise to deliver superior pronunciation accuracy and enhanced control over voice characteristics, setting a new standard in voice technology.

Key Features of DragonV2.1

The upgraded DragonV2.1 model supports speech synthesis in more than 100 languages, requiring only a brief example of the user’s voice for effective operation. This capability marks a significant progression from the earlier DragonV1 model, which faced challenges with pronunciation, particularly regarding named entities.

DragonV2.1 has a broad range of applications, including:

Customization of voices for chatbots
Dubbing video content in an actor’s original voice across numerous languages

Improved Naturalness and Control

One of the standout features of the new model is its ability to create more realistic and stable prosody, leading to enhanced listening experiences. Microsoft reports a notable average reduction of 12.8% in Word Error Rate (WER) compared to its predecessor, DragonV1. Users can exert fine-grained control over various aspects of pronunciation and accent through Speech Synthesis Markup Language (SSML) phoneme tags and customized lexicons.

Concerns about Deepfakes and Mitigation Strategies

While the advancements provide exciting opportunities, they also raise concerns about the potential misuse of this technology for creating deepfakes. To combat these risks, Microsoft has implemented strict usage policies that require users to obtain explicit consent from the original voice owner, disclose when content is generated synthetically, and prohibit any form of impersonation or deception.

Furthermore, Microsoft is introducing automatic watermarks in the synthesized speech output. This feature boasts an impressive 99.7% detection accuracy across various audio manipulation scenarios, enhancing the security against misuse of AI-generated voices.

Getting Started with Azure AI Speech

To explore the capabilities of the personal voice feature, interested users can try it out on Speech Studio. Additionally, businesses looking for full access to the API can apply here to integrate these advanced features into their applications.

Image via Depositphotos.com

Source & Images