Getting to know you: The race to build a better virtual assistant

Speech recognition company Nuance is working on a project that could allow virtual assistants to know what you want before you've asked for it.

In the science fiction movie Her a man falls in love with an advanced artificial intelligence called Samantha.

Today the closest systems we have to Samantha are perhaps Apple's Siri and Google Now, intelligences that satisfy rather more prosaic needs, such as finding a deli so you can grab a sandwich.

But beyond their limited scope what's missing from today's virtual assistants is an understanding of you and your place in the world.

"One of the fathers of AI, a guy called Hubert Dreyfus, nailed it. He said 'to get human levels of accuracy the system requires human senses'," said Seb Reeve.

Reeve works as a director of product management for Nuance - the company widely believed to provide the speech recognition software for Apple's Siri after various revelations and statements, but which today refuses to confirm it publicly.

"The domain of the conversation - which room I'm in, with whom I'm speaking, your body language - all of those senses are important for me to continue this conversation in the way I am now," he said.

"Natural language understanding is not just about saying 'Did you recognise what I said?', 'Are the words coming across OK?', but everything I say has a rooted concept in the everyday world.

"You grew up in the world I grew up in, we have a common context we share. Does a machine understand everything about the world? No. If we don't teach it about the world how can it decode the meaning of the language. So all these concepts need to be taught."

Understanding the world

Nuance is working on how to build systems that understand more about the world and your life - your likes and dislikes, where you are, what you are doing at a moment in time.

At the forefront of this work is Project Wintermute, a service that would bring together speech recognition, natural language understanding and other elements of artificial intelligence to create a virtual assistant that more intuitively understands our wants and needs.

Wintermute is a cloud service that would sit behind apps and services - providing both the speech recognition and understanding of what has been said. Crucial to the understanding would be it's ability to add context to a user's requests, drawing on information about the user and a log of their activity, and it's ability to interrogate third party data.

What sort of experiences would Wintermute enable? John West, principle solutions architect in Nuance's mobile group, gave an example.

"You could start on a mobile device and say 'Tell me how United are doing', and it knows that United to me is Hereford United and not Manchester United. So it will look up the result and see they're losing and come back and say 'I'm sorry they're losing 2-nil at the moment', which would be typical," he said.

"In the car you say 'Put on the Rolling Stones playlist' and it will go play your Rolling Stones playlist from your music provider. Then you move from your car into your house where you start a PC up or your Sonos system and say 'Put on the playlist I was listening to' and immediately it puts on the Rolling Stones playlist from just where you left on.

"Maybe you start watching a film if you're on a laptop but then say 'I've had enough of this lets go sit on the settee and watch it on the TV'. You say 'Throw on the programme I was watching earlier' and it starts it from the same point you were watching it.

"You then say 'How's the game going?' and it immediately knows you asked about the game earlier and says 'They're back in the game, it's now two all.

"We have all these components all available now. Bringing them together is what we're talking to customers about."

Initially Wintermute would likely be employed by companies like Samsung, whose products span TVs, mobiles, computers and wearables and want to offer assistants that provide a consistent experience across their different devices, and in doing so provide a reason for people to stick with their brand.

"The concept is as you change your television you go 'I'll buy the television from that company that's providing me this service so I've got it totally integrated' [with my other devices]. You build this loyalty to a brand across multiple devices," said West.

Because such a service is likely to be seen as a competitive advantage it is less likely that companies will share profiles between assistants, for instance so Apple TV would know you began watching Transformers on Amazon Instant Video on your tablet earlier in the day.

What's interesting about Wintermute's possibilities is that Nuance already has relationships with many of the companies that build our TVs, phones, computers and even cars. Nuance provides voice recognition services for Samsung Smart TVs, in-car dictation systems for Audi and BMW and, albeit not officially, technologies for Apple's Siri.

Nuance's cloud service handles 10 billion speech recognition transactions from software running on TV, mobile and cars each year. About 20,000 mobile app developers use the Nuance Cloud, a service that provides speech recognition capabilities for popular smartphone apps such as iTranslate on iOS and the Evi virtual assistant recently bought by Amazon.

One step at a time

Fictional AI often often converse much like a person, switching from topic to topic at ease.

But to build a reliable virtual assistant you don't start by building a general-purpose conversationalist - rather a system that is good at discussing lots of specific topics.

"Part of the trickery is recognising what's important," said Reeve.

Which elements of a sentence matter depends on the context: in a voicemail message it can be the digits in a telephone number or the time that someone says they'll meet up with you, in a doctor's dictation the drug or the dosage and in a movie app genres or the names of directors.

"We give it a lot of data but then we give it hard coded structure to try and interpret that data," said West.

"If it's TV searching we tell it about how a movie director relates to a movie, but then we give it examples of what movie directors look and sound like to train it and that builds up its understanding."

Identifying elements of an utterance crucial to discerning meaning in difference situations allows Nuance to train its systems to recognise particular subset of words. This targeted approach is a far more effective way of improving user satisfaction than simply trying to increase overall speech recognition accuracy.

"There's a difference between the reality of perceived accuracy vs real accuracy," said Reeve.

West said that over time virtual assistants will be able to address a much broader range of domains.

"Our Dragon Mobile Assistant has something like 20 different domains and the way you talk to them are very natural."

It's ok to make mistakes

Just as important as accurate recognition is that a system can engage in a natural sounding dialogue that doesn't make the user work too hard to be understood, said Reeve.

"Good recognition and bad dialogue is still no answer for the user."

Part of making dialogue smarter is building systems that are able to rapidly check information with the user.

"You don't get so bound up on it having to be perfect recognition. If you can make the error correction really good and effortless then people are much more forgiving of the fact that recognition is never going to be perfect," said Reeve.

Combine this error checking with an ability to parse ambiguous phrases and you can save the user the hassle of making their request explicit. West gives the example of someone in the south of England asking 'Find me a hotel for tonight in Newcastle'. The service knows the user's location and that it's already 4pm and estimates they're most likely asking for accommodation in Newcastle in Staffordshire not in Tyne and Wear, an assumption that can then be checked with the user.

Again the ability of assistants to intuit what you are asking is going to be largely dependent on how much information you're prepared to share about yourself.

Willingness to share information about where we are and what we're doing will increasingly allow assistants to deliver a more targeted vs a more vanilla service, said West.

Speech-driven systems that can track what's been asked of them can similarly react far more intuitively. Reeve demos a banking app where he first says 'Make a payment for £50 this Thursday" and then 'No, make it Friday'. Because the service behind the app has retained information about what he asked initially it is able to recognise he is referring to the earlier payment and make the change.

Beyond assistants understanding what's being asked of them it's important for them to address users in a way that doesn't get annoying or test their patience.

"We did an analysis of a Lloyds Banks customer service application a few years ago where it said 'Please' 18 times in a single dialogue. Much as we like to think we say 'Please' regularly it doesn't happen that often," said West.

"When you're using a personal assistant you are going to have an ongoing dialogue with it. You want some variation in the way the machine interacts and talks with you and we do that as much as we can. We use almost natural language on the output almost as much as we do on the input, so we keep it fresh."

In conversation our understanding of what's been said stems from far more than just recognising the words being spoken. Cracking the everything else is where Nuance is heading next.

"Speech is really the beginning of that journey but this whole idea of AI and teaching machine about the world and to deal with real interactions that's the journey, that's where we want to get to," said West.