Virtual voices: Azure's neural text-to-speech service

Microsoft is using neural networks to deliver more convincing artificial voices.

How Google's WaveNet tech has paved the way for appliances that talk back to you Voysis founder and CEO Peter Cahill on how recent advances in machine-generated speech will change how we interact with machines, speaking at the AI Conference presented by O'Reilly and Intel AI.

The days of the keyboard and screen as our sole method of interacting with a computer are long gone. Now we're surrounded by more natural user interfaces, adding touch and speech recognition to our repertoire of interactions. The same goes for how computers respond to us, using haptics and speech synthesis.

SEE: Alexa Skills: A guide for business pros (free PDF) (TechRepublic)

Speech is increasingly important, as it provides a hands-free and at-a-distance way of working with devices. It's not necessary to touch them or look at them -- all that's needed are a handful of trigger words and a good speech recognition system. We're perhaps most familiar with digital assistants like Cortana, Alexa, Siri, and Google Assistant, but speech technologies are appearing in assistive systems, in in-car applications, and in other environments where manual operations are difficult, distracting or downright dangerous.

Artificial voices for our code

The other side of the speech recognition story is, of course, speech synthesis. Computers are good at displaying text, but not very good at reading it to us. What's needed is an easy way of taking text content and turning it into recognisable human-quality speech, not the eerie monotone of a sci-fi robot. We're all familiar with the speech synthesis tools in automated telephony systems or in GPS apps that fail basic pronunciation tests, getting names and addresses amusingly wrong.

High-quality speech synthesis isn't easy. If you take the standard approach, mapping text to strings of phonemes, the result is often stilted and prone to mispronunciation. What's more disconcerting is that there's little or no inflection. Even using SSML (Speech Synthesis Markup Language) to add emphasis and inflection doesn't make much difference and only adds to developer workloads, requiring every utterance to be tagged in advance to add the appropriate speech constructions.

Part of the problem is the way that traditional speech synthesis works, with separate models for both analyzing the text and for predicting the required audio. As they're separate steps, the result is clearly artificial. What's needed is an approach that takes those separate steps and brings them together, into a single speech synthesis engine.

microsoft-neural-tts.jpg

Microsoft's text-to-speech service uses deep neural networks to improve the way traditional text-to-speech systems match patterns of stress and intonation in spoken language (prosody) and synthesise speech units into a computer voice.

Image: Microsoft

Using neural networks for more convincing speech

Microsoft Research has been working on solving this problem for some time, and the resulting neural network-based speech synthesis technique is now available as part of the Azure Cognitive Services suite of Speech tools. Using its new Neural text-to-speech service, hosted in Azure Kubernetes Service for scalability, generated speech is streamed to end users. Instead of multiple steps, input text is first passed through a neural acoustic generator to determine intonation before being rendered using a neural voice model in a neural vocoder.

The underlying voice model is generated via deep learning techniques using a large set of sampled speech as the training data. The original Microsoft Research paper on the subject goes into detail on the training methods used, initially using frame error minimization before refining the resulting model with sequence error minimisation.

Using the neural TTS engine is easy enough. As with all the Cognitive Services, you start with a subscription key and then use this to create a class that calls the text-to-speech APIs. All you need to do is choose one of the neural voices to use the new service; the underlying APIs are the same for neural and standard TTS. Speech responses are streamed from the service to your device, so you can either direct them straight to your default audio output or save it as a file to be played back later.

SEE: Artificial intelligence: A business leader's guide (free PDF) (TechRepublic)

Neural voices still support SSML, so you can add your own adjustments to the default voices. That's in addition to their specific optimisations for specific speech types. If you don't want to use SSML, pick a neural voice by characteristic -- a neutral voice or a cheerful voice, for example. SSML can be used to speed up playback or change the pitch of a speech segment without changing the synthesised voice. That way you can allow users to adjust output to suit their working environment, allowing them to choose the voice settings they find appropriate.

Microsoft has made neural voices available in several regions, although for more language coverage you'll need to step back to using the older, standard speech synthesis models. Neural voices are available in English, German, Italian and Chinese, with five different voices. Most are female, but there's one male English voice.

Adding neural voices to your apps

So where would you use neural voices? The obvious choice is in any application that requires a long set of voice interactions, as traditional speech synthesis can be tiring to listen to for long periods. You also want to use neural voices where you don't want to add to cognitive load -- a risk that's reduced by using a more natural set of voices. Digital personal assistants and in-car systems are a logical first step for these new techniques, but you can use them to quickly create audio versions of existing documents, reducing the costs of audiobooks and helping users with auditory learning styles.

If you want to get started using neural voices in your applications, Microsoft provides a free subscription that gives you 500,000 characters of recognised text per month. As neural voices do require more compute than traditional sample-based methods, they are more expensive to use, but at $16 per million characters once you move out of the free service, it's not going to break the bank -- particularly if you use the option of saving utterances for later use. These can be used to build a library of commonly used speech segments that can be played back as required.

With speech an increasingly important accessibility tool, it's good to see the state of the art moving beyond stilted, obviously artificial voices. Microsoft's launch of neural voices in its Cognitive Services suite is an important step forward. Now it needs to bring them to more languages and to more regions so we can all get the benefit of these new speech synthesis techniques.

Also see