Businesses have utilized Neural TTS in various scenarios, such as voice assistants, video games, online-learning, accessibility tools for content read-aloud, and a lot more. Check out these customer stories featuring companies like Vodafone, Vegas and Pearson that are using Neural TTS to transform their business. To better support the diverse customer use cases and make their voice experience even more natural, a richer selection of voice options and a variety of speaking styles especially emotions become critical.
Today we are excited to announce the release of 5 new neural voices in American English (en-US) and introduce 10 new speaking styles. The new speaking styles include 8 emotions, in addition to shouting and whispering. Customers can access the new speaking styles with nine en-US voices, including the 5 new ones. With these updates, Azure TTS enables customers to develop apps that better mirror human voices and express emotions. Currently the new voices and styles are in preview.
5 new neural TTS voices in en-US
With the 5 new voices added to the portfolio, Neural TTS now supports 20 voices in American English, allowing a richer choice of voice personas that addresses wider user scenarios for more customers.
Check out below table for the new members to the en-US voice family and hear how they sound. You can also try your own text with these voices on this demo.
With this release, we extend speaking styles to more voices. Now Azure TTS enables 8 emotions and finally add shouting and whispering for nine en-US voices: Aria, Davis, Guy, Jane, Jason, Jenny, Nancy, Tony, Sara
We build a number of new emotional styles to both male and female voices. Currently the emotions enabled in en-US voices include cheerful, sad, angry, excited, friendly, unfriendly, hopeful and terrified
Below are the samples from Jenny, one of the voices with emotions enabled. Hear how each emotion differs from others:
Expresses a positive and happy tone.
Expresses a sorrowful tone.
Expresses an angry and annoyed tone.
Expresses an upbeat and hopeful tone. It sounds like something great is happening and the speaker is really happy about that.
Expresses a pleasant, inviting and warm tone. It sounds sincere and caring.
Expresses a cold and indifferent tone.
Expresses a warm and yearning tone. It sounds like something good will happen to the speaker.
Expresses a very scared tone, with faster pace and a shakier voice. It sounds like the speaker is in an unsteady and frantic status.
Check out more samples of these emotions on en-US voices:
Shouting and whispering
Azure TTS supports shouting and whispering styles for the first time. With the shouting style, you will be able to hear that someone is speaking from a far distance or trying to be heard clearly in a noisy place. For the whispering style, you can make the voice appear to be speaking in private or telling a secret. These 2 styles make a character speak more vividly with Azure TTS in video game, audiobook, or film.
Here are shouting and whispering TTS samples from Jenny voice.
Speak like from a far distant or outside and to make self be clearly heard
Speak very softly and make a quiet and gentle sound
The technology behind: Style Transfer to build voice styles in scale
To enrich the style support and keep style parity for as many TTS voices as possible, we have applied a technology called “Style Transfer” to build speaking styles efficiently. Style Transfer is a method to apply the speaking tone and prosody (i.e., pace, intonation, rhythm) of one speaker (source speaker) to another speaker (target speaker). The result of the Style Transfer is the target speaker adopts the tone and prosody of the source speaker yet keeps their own voice timbre.
Conventionally, to build a voice style for TTS, we need to collect style recording data e.g. emotional speaking data from the original source voice actor. However, sometimes we are unable to gather significant emotional data due to voice actor availability, or gaps in the voice actors’ emotional range.
The innovation of Style Transfer solves this customer challenge effectively. (See our Interspeech 2021 paper for details). With as few as 100 recorded utterances, we can learn the speaking style and apply it to a target speaker with good quality (MOS gap to source emotion recording < 0.2) on top of UniTTS v4. This technique is widely adopted in expanding the styles of these en-US platform voices.
How to use
The new voices and style expansion are in public preview. The 5 new voices—Davis, Jean, Jason, Nancy, and Tony--are only available in three service regions: East US, West Europe, and Southeast Asia. For the existing voices including Aria, Jenny, Guy and Sara, the new and expanded speaking styles are accessible in all service regions.
Below is a short SSML snippet of using the 'mstts:express-as' tag to trigger speaking styles:
You can also easily create audio files with these voices and styles using our Audio Content Creation tool, without writing a single line of code.
We are inspired by how the Style Transfer technology allows us to bring new voice styles to our customers. In the future, we expect to apply this process to other languages to improve global reach and accessibility of TTS. In addition, we are evaluating the potential to implement Style Transfer to Custom Neural Voice (CNV). With this implemented, a custom neural voice would be able to feature multiple styles without needing additional recording data. If you are interested in learning more about Style Transfer for CNV, please send us a note at mstts[at]microsoft.com.