Remember the famous scene in Stanley Kubrick’s 1968 2001: A Space Odyssey, when Hal 9000–the intelligent-turned-malevolent computer–regresses to his “childhood” and sings “Daisy Bell” as he’s decommissioned by astronaut Dave Bowman? Its inspiration was a real-life Bell Labs demonstration of speech synthesis on an IBM 704 mainframe in 1961, witnessed by Arthur C Clarke, who later incorporated it into his 2001 novel and screenplay.
Although Bell Labs’ involvement in the field stretches back to the 1930s with Homer Dudley’s keyboard-and-footpedal-driven Voder speech synthesis device, it’s undoubtedly the classic Kubrick/Clarke movie that cemented the ideas of artificial intelligence (AI) and conversing with computers into the public mind.
Depending on how old you are, we’re now familiar with computerised voices, thanks to devices like Texas Instruments’ popular 1978 Speak & Spell educational toy, Stephen Hawking’s speech synthesiser (memorably sampled in the Pink Floyd song Keep Talking), GPS navigational systems in your car, and any number of public information and call handling systems.
More recently, the combination of automatic speech recognition (ASR), natural-language understanding (NLU) and text-to-speech (TTS) has come to mainstream attention in virtual assistants such as Apple’s Siri, Google Now, Microsoft’s Cortana, and Amazon’s Alexa.
How speech and language work
To get a handle on how speech technologies work, we clearly need to know something about the mechanics of human speech and the structure of language.
When we speak, air from the lungs passes through the vocal tract to produce “voiced” or “unvoiced” sounds (depending on whether the vocal cords are vibrating or not) that may then be modulated by the tongue, teeth and lips. At its most “atomic,” speech is stream of audio segments called “phones,” among which are characteristic frequencies called “formants” that can be used to identify vowels (sounds produced with an open vocal tract, all of which, in English, are voiced). In this spectrogram, which is produced by applying a Fast Fourier Transform (FFT) to the speech waveform, the author is saying: “a, e, i, o, u” and it’s clear that each one has a distinctive signature:
The basic unit of language is the phoneme, which is defined as the smallest part of a word that, if changed, alters its meaning: /p/ and /b/ are phonemes in English, for example, because “pack” and “back” are words with different meanings. Phonemes can be thought of as the conceptual building blocks of words, whereas phones are the actual sounds that we make. To illustrate the difference, consider the words “pin” and “spin”: in the former, the “p” sound is aspirated, while in the latter it is not. The phoneme /p/ therefore has two “allophones,” usually written as [ph] and [p], that sound slightly different but will not change the meaning of a word if substituted (instead, such substitutions merely result in an odd pronunciation).
How computers recognize speech
Automatic speech recognition (ASR) involves digitising the continuous analogue stream of phones, formants and gaps that comprise “utterances,” slicing these sequences into chunks and then applying a battery of statistical–and more recently, machine learning–techniques to the extracted features from those chunks to identify and output the most likely words that are being spoken. The ASR pipeline looks like this:
The front end’s job is signal processing–converting the speech waveform to a digital parametric representation, and also cleaning up the extracted features to maximise the signal-to-noise ratio. This is carried out in chunks called “frames,” usually of 10 millisecond duration, within a sliding “context window” of around five frames. Windowing allows for the fact that the signature of a phone will depend not only on the spectral features of a given frame, but also those preceding and following it.
The most common acoustic features extracted from speech input signals are Mel-Frequency Cepstral Coefficients, or MFCCs, although there are other methods, including Linear Predictive Coding (LPC) and Perceptual Linear Prediction (PLP).
The back end, or decoder, has three components. First, an acoustic model, which is built by compiling statistical representations of phones from a large speech database or “corpus,” selects the most likely strings of phones that are represented by the extracted acoustic features–traditionally via Hidden Markov Models (HMMs). Second, strings of phones are matched to words using a pronunciation dictionary or lexicon. And third, a language model constrains the decoder’s choice of words to those that are most likely to make sense. Statistical language models (conventionally n-grams) are compiled from large corpora of text (typically much larger than those used for acoustic modelling), usually from specific “domains” or subject areas.
There are many challenges that ASR engines need to address. For example, recognition accuracy is affected by the quality of the microphone used, and by the level of background noise. Refinements in signal processing and acoustic modelling help to create more noise-robust speech recognition, which is especially important as ASR use cases move from relatively quiet offices and homes to noisier mobile environments.
People’s accents and speaking styles also vary widely, of course, which is why most ASR systems benefit from the creation of user profiles from supplied training texts, so the decoder can fine-tune its “speaker-independent” acoustic model. People may also use words that are not in the language model or the lexicon, so the software also needs to be able to add “out of vocabulary” words and record their pronunciation.
Accurate speech recognition doesn’t just depend on identifying phones via an acoustic model and then looking them up in a phonetic dictionary, because similar-sounding words and phrases can have entirely different meanings. For example, a purely acoustic model might give equal weight to “it’s fun to recognise speech” and “it’s fun to wreck a nice beach”, even though the former is clearly the most likely utterance. Here’s how similar they look on a spectrogram:
This problem also exists at the word level in the shape of homophones–words that sound the same but differ in meaning–and may differ in spelling)–of which there are many in English, including “morning” and “mourning,” “birth” and “berth,” and “whole” and “hole,” for example.
This is where the decoder’s language model comes in. Its job, as noted above, is to maximise the likelihood that word sequences identified by the the acoustic model and the lexicon actually make sense, the conventional modelling approach being the n-gram. An n-gram language model estimates the likelihood of the next word in a sequence given the previous n words, with the probability distribution for all possible combinations estimated from a large corpus of training data. Because of the large numbers involved (a vocabulary of size V has Vn possible n-grams), n usually equals 2 (a bigram model) or 3 (a trigram model).
Error rates and the rise of neural networks
A common measure of the accuracy of a speech recognition system is the Word Error Rate, or WER. This is computed by aligning a spoken reference text to the ASR output and calculating: WER = (S+D+I)/N where S is the number of “substitutions” (words that the ASR has recognised wrongly), D is the number of “deletions” (words present in the reference text that are absent from the ASR output), I is the number of “insertions” (words present in the ASR output that are absent from the reference text) and N is the total number of words in the reference text.
In recent years ASR systems have seen big improvements in word error rates thanks, in particular, to more efficient acoustic models that use machine learning in the form of Deep Neural Networks (DNNs) to determine how well HMM states fit the extracted acoustic features rather than statistical techniques such as Gaussian Mixture Models (GMMs), which were the preferred method for many years.
This graph, from a blog by speech specialist Nuance’s research director Nils Lenke, shows how WERs declined sharply from around 2010 as neural networks were successfully incorporated into ASR systems:
The current state of the art for WER in voice search and virtual assistants is now in single figures: Google claimed 8 percent for Google Now at its I/O conference in May 2015, for example, which Apple swiftly countered with 5 percent for Siri at its Worldwide Developers Conference in June.
So how did these improvements come about? Nuance’s Nils Lenke takes up the story:
“We had seen neural networks back in the nineties, and I remember dealing with them back then, but they never took off in either speech recognition or AI because people didn’t really know how to train them on the hardware available back then. When the Deep Neural Networks, or DNNs came back in 2010, it was an interesting turning point because in the last so-many decades we had been trying to squeeze more accuracy out of Hidden Markov Models–and had been quite successful. But HMMs had been squeezed for so long that it was coming to an end, and DNNs opened a completely new universe of possibilities.”
“Even in the few years since they came back, the topologies and the architectures have changed a lot, and many different things have been tried out, like CNNs (Convolutional Neural Networks) and other mechanisms that better capture the nature of speech as being embedded in time: you can look into the history of what happened in the last few milliseconds up to a few seconds ago, and also try to guess the future, because speech is not a static thing–it develops over time. Some of the newer architectures capture this better than others. There are so many things you can try out, like how many hidden layers there are, and how to train them on GPUs [Graphical Processing Units]. The search space in which you look for better solutions is so much bigger now, and there’s so much potential for the future.”
So what were the specific technical advances that allowed neural networks to make their 2010 comeback? Lenke also took up that subject:
“First of all it was really the hardware,” says Lenke, “because it allowed us to train massive networks on massive data. One thing everyone is profiting from is GPUs, which are good at working on large amounts of data in parallel–doing relatively simple things to a lot of data at the same time, which is very different from what CPUs are good for, which is doing fairly complex things on small amounts of data. That’s what we do in speech recognition because it’s mostly driven by statistical methods, so we need to look at large statistical models and large amounts of data when training those models. That’s also what neural networks are about. GPUs help you shorten training times by a factor of five or ten or twenty times and make it realistic to train those models. When we started doing DNNs again, training might take weeks or months–and in an industrial application, for example, you need to do it for many different settings and languages, so you can’t afford training times of several weeks. While you’re trying out what works best you need to run multiple cycles, and you can only do that when they take hours or days to complete. That’s where we are now, using GPUs.”
Another key development was the introduction of “pre-training” for neural networks, to ensure that the system settled on the global optimal solution rather than some suboptimal “local maximum.”
“If you start with a naked neural network, which doesn’t know anything, when it starts learning it can go off in the wrong direction,” explains Lenke. “The trick that the pioneers in the field, like Geoffrey Hinton, Yoshua Bengio and others, came up with was the idea of pre-training–doing a rough pass over the network to teach it a few fundamental things so it would already be headed in the right direction before you started with the real training.
“The network needs to discover many things–if you look at rule-based or linguistic approaches, you know a few things about language, like what the phonemes are, what a vowel looks like and which phonemes are longer than others. But the neural network, starting from scratch, needs to discover all those aspects: when you do a pre-training, that gives it a rough idea of how things are structured. It’s much better to then learn the finer aspects, such as what do individual phonemes look like. That avoids you getting stuck in a state that looks optimal but is actually a local maximum, and a much better solution could have been reached if you had initially headed in a slightly different direction.”
Lenke notes that the term “neural network” can be somewhat misleading. “In the 90s, people thought that neural networks were inspired by how human brains work,” he said. “If you look at how brain cells work, they have an input channel, an output channel, and they do something electrical that looks fairly simple–they sum up what comes in on the input side and then either fire off, setting things in motion with other cells, or not. Of course, the complexity of the brain is still orders of magnitude higher than in neural networks, because what [neural networks] capture is really the aspect of collecting input from a few cells, doing a very simplistic mathematical function, and processing output to the next layer of cells. There’s no ‘magic’ going on there–mathematically speaking, they’re not that different from an HMM.”
Similar reservations apply to IBM’s brain-inspired “neurosynaptic” chip, says Lenke.
“You can look at [the IBM chip] as a replacement for GPUs–hardware that helps you run neural networks. But it’s still not analogue. The thing about the human brain is, it’s analogue. There’s no digital component there. Traditional computing hardware is not very suited to running neural networks, and we try to bridge this by using GPUs. Having dedicated hardware may help, but again, it’s a little dangerous to imply that it’s closer to a real brain. You could say it’s inspired by our very superficial understanding of how the brain works.”
Speech recognition has made great strides in recent years, thanks mainly to neural networks and the highly parallel multi-core GPUs on which they are trained. But ASR is still by no means perfect, and everyone encounters amusing transcription errors from time to time.
For example, the BBC provides subtitles on its live broadcasts for the hard of hearing, using specially trained staff who “respeak” the soundtrack (with light on-the-fly editing but as little delay as possible) into ASR software, which then displays its output as an on-screen caption. You can find plenty of choice gaffes online, including the “Chinese Year of the Horse” (2014) becoming “the year of the whores,” the former UK Labour Party leader Ed Miliband becoming “the Ed Miller Band” and the UK government “making helpful decisions” becoming “making holes for surgeons.
Natural language understanding
Speech recognition is a complex enough task for a computer, but the next stage towards some sort of conversational interaction is to extract meaning from the sentences that humans utter. This is clearly important in applications such as voice search and virtual assistants–both general-purpose ones such as Siri, Google Now, Cortana, and Alexa, and those aimed at vertical markets such as Nuance’s Dragon Drive in-car assistant and the Florence assistant for physicians.
Adding a Natural Language Understanding (NLU) layer to speech recognition to deliver better voice search, or a virtual assistant such as Siri, requires knowledge of syntax (the rules governing sentence structure in a given language), semantics (the study of meaning at various levels–words, phrases, sentences and so on), how human dialogues are structured, and access to online resources, including large-scale knowledge bases, from which responses can be crafted.
The development of Google’s voice search capabilities provides a useful example of what NLU adds to ASR. Initially Google Voice Search simply allowed you speak rather than type your search keywords. Then in 2012, the Google Now virtual assistant added semantic search information from Google’s Knowledge Graph project. The Knowledge Graph is a multi-source knowledge base about people, places and things that allows the search system to distinguish between, for example, “Taj Mahal” the Indian mausoleum, “Taj Mahal” the blues musician–and, for that matter, your local “Taj Mahal” curry house.
In November 2015, Google announced that its voice search app could make better use of the Knowledge Graph by analysing the semantics of more complex queries such as this one:
The seeds of a potential successor to the Knowledge Graph have already been sown, in the shape of the Knowledge Vault (KV), which is designed to automatically extract facts from the entire web in order to augment the information collected in conventional knowledge bases.
The Google researchers make an explicit analogy between their still-experimental system and speech recognition:
“The Knowledge Vault is different from previous works on automatic knowledge base construction as it combines noisy extractions from the Web together with prior knowledge, which is derived from existing knowledge bases… This approach is analogous to techniques used in speech recognition, which combine noisy acoustic signals with priors derived from a language model. KV’s prior model can help overcome errors due to the extraction process, as well as errors in the sources themselves.”
One of the best-known natural-language query-processing systems is IBM Watson, which was initially developed with text-only input and output. In 2015, however, IBM announced the addition of speech capabilities (speech-to-text and text-to-speech services) to the Watson Developer Cloud. For an in-depth look at the history of IBM Watson, see Jo Best’s 2013 TechRepublic cover story.
Speech synthesis and text-to-speech
Having taught computers to recognise the words humans speak (ASR), and to some extent understand what those words mean (NLU), how exactly do computers talk back? How is speech synthesis–and, when the input is the written word, text-to-speech (TTS)–actually achieved? The TTS pipeline is essentially the inverse of the one for ASR described earlier:
The first stage is to convert written text into the words that the TTS system will, at the end of the pipeline, speak. This text-analysis process, also known as “normalisation,” involves the conversion of numbers, dates, times and abbreviations to word form.
For example, after normalisation, the following sentence:
William Shakespeare wrote Henry IV Part 1 no later than 1597
William Shakespeare wrote Henry the Fourth part one no later than fifteen ninety seven
Note that the text-analysis system needs to recognise the roman numerals in “Henry IV” and determine that they should be spoken (in British English at least) as “Henry the Fourth” rather than “Henry Four”, and that “1597” should be spoken as “fifteen ninety seven” rather than “one thousand five hundred and ninety seven.”
The next stage is linguistic analysis, which determines how the normalised words should be spoken. For example, the system will examine a word’s context within the sentence in order to distinguish between heteronyms–words with the same spelling but different pronunciation and meaning, such as “bass” (the fish) and “bass” (the musical instrument), or “tear” (the lacrimal gland secretion) and “tear” (to pull apart). Phonetic transcriptions are then applied to the words, usually via a combination of dictionary- and rule-based approaches, and prosodic information is added in order to generate more natural-sounding speech prosody being the conveyance of meaning via intonation, pauses and other “suprasegmental” speech information.
The final link in the TTS pipeline is synthesis, which results in the generation of a waveform that will deliver the front end’s output as recognisable speech. The main classes of speech synthesis are concatenative, formant, articulatory and HMM-based.
Concatenative synthesis literally “strings together” bits of recorded speech that have been chopped up into a variety of components–phones, diphones, phonemes, syllables, words, phrases and sentences, for example. This can produce natural-sounding speech, depending on the size of the speech database, the quality of the component selection algorithm, and how much signal-processing is applied at the point of concatenation.
Formant synthesis is an an acoustic technique based on the source-filter model, much like a music synthesiser. This makes it well suited to switching between voices and languages (unlike concatenative synthesis), but the trade-off is a less natural-sounding, more robotic, voice.
Articulatory synthesis requires a detailed computer model of how acoustic waves are generated and modified in the human vocal tract. Although theoretically the most flexible type of speech synthesis, articulatory models are complex and not widely used. A notable example is the open-source gnuspeech project.
HMM-based synthesis, also known as Statistical Parametric Synthesis, uses Hidden Markov Models trained on a corpus of recorded speech to generate speech parameters that best fit the input from the TTS front end. It’s straightforward to adapt HMM-based synthesisers to different voices and speaking styles, but drawbacks can include buzzy, flat or muffled-sounding speech. Deep neural networks have also been used as the basis for statistical parametric speech synthesis in recent years.
The speech-related technologies we’ve examined–speech recognition, natural language understanding and speech synthesis–come together in virtual assistants such as Google Now, Alexa, Cortana, and Siri.
Here’s a high-level view of how Apple’s Siri is architected:
It was Siri, in particular, that first brought these technologies to mainstream attention, according to Nuance’s Nils Lenke.
“When you look at speech recognition, ten to fifteen years ago there were still systems using closed rule-based grammars, while today every speech-recognition task we do is more-or-less driven by statistical language models,” says Lenke. “You see the same trend in natural-language understanding: rules and knowledge and explicit handling of data in AI still has a position and value–but it only works because it’s embedded into systems that are fully driven by machine learning and statistical methods. That’s what brought natural-language understanding to the main playing field, I would say–especially when Siri came out, when all of a sudden there was much bigger interest in the field.”
So how does Lenke see Siri and its general-purpose virtual assistant brethren developing?
“I think the jury is out on what will happen with them, because obviously they sparked a lot of interest, and this will never go away–it was a great help for the field, and there are many people using them. But the question is, will the mainstream take them up as general-purpose assistants? If you look at Siri and other systems, they start off with a few domains and then you add domain after domain to go more in-depth, so the system knows how to do restaurant bookings or deal with the calendar, for example. But there’s always so many more domains out there, and it’s very hard for the users to know what to expect them to do.”
Judgement may be reserved on general-purpose virtual assistants, but Lenke foresees at least two potentially fruitful developments.
“One is, you now have the Internet of Things, with devices where you don’t have a good alternative to using voice, and assistants will play a big role there. Because there will be so many different devices, many people will try things out, so we [Nuance] try to support them by giving them cloud-based ASR so they can build their own assistants. The Internet of Things will be a very diverse and interesting landscape, with many different types of devices and assistants, and we’ll see how it goes.”
The IoT may be a work in progress, but another arena, the car, is seeing plenty of virtual-assistant action right now.
“It’s a place where it’s very natural for people to talk, because they are used to it with navigation systems,” says Lenke. “But now, with cars being more internet-connected, people want to check their emails and their social media, get alerted if something happens, and send out messages. Also, the technical capabilities are there: you still have your embedded technology on-board, but you also have the ability to add cloud-based ASR and NLU. This means you can do hybrid systems that marry these two, so you’re still able to do things when there’s no connectivity–in a tunnel, for example. This combination of people expecting to do things and the capability to actually do it means that the car will be a very important market for these developments, and we [Nuance] have Dragon Drive systems out with BMW, Ford, Daimler and many other OEMs.”
But what about recent research from the US suggesting that using voice commands to control various in-car “infotainment” systems–including those using Siri, Cortana, and Google Now–can be as distracting as talking on a smartphone?
“Even listening to the car radio distracts you, and navigation systems distract you, there’s no question there,” admits Lenke. “We measure that by standardised tests, and we take it very seriously,” he insists. “We absolutely want to enable people to do things in a safe way and not create additional risks.”
But general-purpose assistants may not be the way to go here, adds Lenke: “I’m not surprised that, if you take systems like Cortana, which weren’t designed with the car in mind, distraction may be higher than ideal. What we find is that a well-designed system distracts you of the order of magnitude of your navigation system or your radio, but significantly less than if you operate a smartphone in the car–and that’s the alternative that needs to go most of all.”
Working with car manufacturers, rather than planting a general-purpose virtual assistant on the dashboard, is more likely to pay dividends in terms of safety, says Lenke: “You can design a better system by using the sensors in the car to decide when you should do something with the driver–you can see what he or she is up to, and basically estimate the cognitive load before you decide what the next move should be. These are things you can do when you work with the car manufacturers.”
Testing the virtual assistants
While writing this article, we’ve done some informal testing of Siri, Google Now, and Cortana, with a range of spoken queries, including some complex searches that test the system’s understanding of semantics and human dialogue, and the quality of its knowledge base.
Generally speaking, despite some impressive responses, there’s absolutely no chance of thinking you’re interacting with a human, or even a particularly advanced AI.
For example, I asked all three assistants the same question: “When is Spectre playing at the Odeon Milton Keynes?” with the follow-up “And how do I get there?”–an example of anaphora, in which “there” refers back to “the Odeon Milton Keynes” in the previous query. Here’s how they responded:
Siri said: “I didn’t find exactly what you were looking for, but here’s Spectre at Cineworld Milton Keynes playing today.” Displayed on-screen were showtimes for the James Bond movie at a different Milton Keynes cinema (the Odeon IMAX cinema in Milton Keynes is a relatively recent development that opened in February 2015, so Siri’s knowledge base is clearly not up to speed here). Apple’s assistant also missed the anaphora link, responding “Where would you like to go?” to the follow-up question.
Google Now did much better, saying: “Spectre is playing at Odeon Milton Keynes Stadium and IMAX” and displaying an on-screen panel with the correct showtimes. The follow-up question was handled elegantly too, via a spoken “Odeon cinema is 19 minutes from your location in light traffic. Here are your directions” and an on-screen panel with a small map, directions and a link to Google Maps.
Microsoft’s virtual assistant was the only one to struggle with basic ASR, returning “teens,” “keens” and “kings” on different occasions instead of “Keynes” (perhaps because, in order to get Cortana to work at all under Windows 10 in the UK at the time of writing, the language had to be set to US English). Cortana had no voice response to the initial query, merely displaying the results of a Bing search on the keywords. No surprise, then, that the follow-up response was similarly unhelpful: “Alright, where should I get directions to?”
There isn’t space in this article to present a detailed comparison of the leading general purpose virtual assistants, but the above example shows that results can vary from impressive to underwhelming. Different virtual assistants will emerge as the “winner” depending on the domain and the query type, although overall we currently find Google Now to be the most successful.
The AI future
As well as envisaging HAL in 2001: A Space Odyssey, Arthur C Clarke was responsible–among many other things–for three “laws” of prediction, the third of which states that: “Any sufficiently advanced technology is indistinguishable from magic.”
Today’s virtual assistants are firmly grounded in digital signal processing, statistical modelling and machine learning, backed up by large amounts of training data and knowledge base information. Although increasingly capable, they are still a long way from “magical.”
Even so, recent developments such as near-real-time language translation–as seen in Microsoft’s Skype Translator–are mightily impressive. And when this ASR/machine translation/TTS system can be used with some sort of wearable (or even implantable) computer, the Babel fish from Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, or the Universal Translator from Star Trek, will have taken a significant step from 1970s science fiction to 21st century science fact.
So how does Nuance’s research director see these technologies moving forward: new kinds of algorithms, more GPU-style computing resources, even bigger data sets to train models on?
“Deep neural networks are here to stay, obviously, and we’ll have more headroom for growth there,” says Lenke. “With HMMs, we were very good at getting them to be speaker-adapted systems, and we’re currently researching how we can apply the methods we had for HMMs to do a similar job for DNNs. Then if you think of ASR in a clean room with a headset on, many people would see that problem as nearly solved, because accuracy is so good it’s getting close to what humans can do. The challenge is, people don’t do this in clean rooms–in an office with a laptop, for example–anymore; they do it on their smartphones, in the car, at a train station or wherever they are. Then you need context. What is he or she looking at, what is the noise level, what are they doing with the smartphone? Adapting speech recognition to all this is really the challenge ahead.”
And as far as AI generally is concerned, where does Lenke stand on the dire warnings we’ve heard from the likes of Elon Musk, Bill Gates and Stephen Hawking?
“I think the risk that machines will develop and even control the human race is very limited. That’s just science fiction,” he says. “The real promise is in helping humans, who will remain the master and set the goals. I don’t see a risk that we will ever be taken over.”
This seems a reasonable position, given that the human brain, with around 100 billion neurons and 100 trillion synapses, is many times more complex than even the biggest neural network, and that the mechanisms behind properties such as consciousness remain elusive.
Of course, such difficulties will not halt the pursuit of “true” machine intelligence–often called Artificial General Intelligence, or AGI–a roadmap for which was presented in November 2015 by leading AI researchers at Facebook. The bottom line is that AI is developing fast, and we are well on the path to being able to converse with a HAL-like entity.
The image at the top of this article was taken by iStock user slphotography.