It’s only a year since Microsoft bought conversational AI company Semantic Machines — staffed with researchers and developers who worked on Dragon NaturallySpeaking, Siri (before and after Apple bought it) and Google’s language and translation products — and combined them with the team behind Cortana.
At the time, co-founder and CEO Dan Roth told us to expect developments relatively soon: “Our approach and the direction that we’re heading is going to have a big impact, and it’s not going to be something that needs years and years to be visible to the outside world.”
SEE: Excel Ideas: An intelligent data visualisation tool (TechRepublic)
The conversational engine that Satya Nadella talked about at Build that makes Cortana a conversational interface for Microsoft 365, and will also be available for developers through the Bot Framework (and through things like Cognitive Services and the Dynamics 365 support products) is based on the work that Semantic Machines has been doing. It will be able to cope with lengthy ongoing conversations with interruptions, and it will be able to understand multiple domains of knowledge instead of you having to explicitly say what skill you want to ask about.
“Today, the sorts of experiences that people are able to have with these systems are quite limited,” Roth said. “Usually what people encounter if they’re using a language interface today is that they say something, the system would either get it right or wrong, and either way the session resets. There’s not really the notion of a sustained conversation where there’s a context being built up that can be modified by the user or the system, with the sort of clarifications and disambiguation and corrections you’d want.”
The problem is that existing ‘conversational’ interfaces can only interact on preset topics that they’ve already been taught about, and they have a limited number of actions associated with those tasks because they have to be mapped out in advance.
“These are fairly linear systems where you have a thin machine-learning layer at the top, which you can broadly think of as an intent classifier and what it does is it takes in some language and decides from an inventory of predetermined intents ‘what do we think this person is asking about?’ Are they asking about a song or are they asking for a weather report or a podcast or about news?,” Roth explained.
“Whatever it is, it’s a quantized list of things that some programming team has defined in advance. If the language you use is different from the language the system was trained on, then at that layer, you’re out of luck. Even if it is one of the templates it was trained on, then you’ll only get a predetermined response from the system. As a result, the kinds of experiences you can have are pretty flat. You can ask for things to be turned on or off, you can request certain kinds of information. But you can’t really dive into that information: you can’t ask refining questions and you can’t move from one domain into another.”
It’s not so much that the skills for voice agents are coded as separate skills, often by separate teams, although that doesn’t help. It’s the sheer complexity of everything that someone could be asking about. “You very quickly get into this combinatorial explosion, with the numbers of different sorts of contacts and language and what’s called ‘under-specification’ — the way that people refer to things in a kind of shorthand in the middle of the conversation,” said Roth. “It’s a vastly complex and combinatorial space, and it just gets way beyond what those kinds of system can handle. They can’t do the things where a human says ‘I’m not sure what you said to me, can you say it again or say it differently?’, or clarify or ‘did you mean x or y?’ — it really doesn’t have the capacity to do that.”
Richer data, richer models
The new conversational engine can handle the complexity using what Roth says are much richer data representations and much richer machine learning models. A big part of that is creating a system that can teach itself, instead of requiring developers to create all the templates and mappings from what people might say to what they want to do. “We have come up with methods for the system to be able to effectively write its own capabilities on the fly, so it isn’t limited to the kinds of experiences that developers have contemplated a priori. It’s a very flexible machine learning model that can generalise, and that’s the key to being able to handle the long tail of kind of combinations, requests and actions each user will have. It’s about moving away from a world where things are programmed and into a world where things are learned, and where the functionality of the system can be learned.”
That doesn’t mean the engine can go off the rails like Microsoft’s Tay bot, because the learning is still in a controlled environment, Roth noted. “It’s not learning in the wild, so it doesn’t have the potential to go and learn things we don’t want it to learn — it’s more of a supervised approach.”
Letting voice agents learn functionality means that the conversational engine can scale to far more areas than an agent that has to be programmed for every domain and interaction. “This holds the potential to have this system be able to handle so many more of the things people that will inherently want it to do, without the programming team having to sit down and actually write code that instantiate that functionality,” said Roth.
Although Roth says that the methods used are novel, he compares the approach to the shift in machine translation from writing rules manually to using machine learning to create translation functions. “It’s really applying tried-and-true machine learning methods to the space of language interfaces,” he explained.
“Language is too complicated, the tail is too long tail, the range of human expression is too enormous. There’s just no way to have enough rules written down that you ever get a system that is satisfactory. Today, language interfaces to smartphones or smart speakers are still very much living in the equivalent world of rule-based machine translation systems where every piece of functionality is essentially written down by hand by programmers. We’re moving to this full end-to-end machine-learned approach, where instead of trying to predict or anticipate everything that everyone will want to do, you produce the data that you can learn from. You have to produce the data that will map all the richness of human expression to all the complexity of back-end functionality that people want, so eliminating all that software in between and letting a system learn those connections. We’ve figured out how to connect language to grounded, task-oriented, agent-type systems.”
The system can learn the different ways that people express the same command, Roth said. “We handle things like lexical variation in this pipeline: ‘turn on the light’, ‘switch on the lamp’ and ‘make it bright in here’ are all talking about the same thing. The system will learn on its own to cluster values in a way that it understands is the right way. It learns on its own from this pipeline how to draw these important relationships between language and the underlying action sequence that’s needed.”
That can be personalised for different users. “The system interacts with all kinds of back-end APIs and some APIs you can connect would have information about a particular user’s preferences, and so you can actually condition or reduce what’s called ‘deductive bias’ about how to make various choices in the system through that personalisation information,” said Roth.
The same pipeline that handles those lexical variations can also handle multiple languages, meaning Microsoft will be able to support the many languages that enterprise customers need. “There’s no language dependency in the system — it’s fundamentally language agnostic,” Roth said. “From our standpoint, you have language and it can be any language coming in the front, and the system can learn what actions to take in response to this language. It’s all deep learning — it’s not based on keywords or anything like that. To us, other languages can be thought of as extreme forms of paraphrase: the German for something is really not very different than another way to say something in English.”
Not having to figure out the precise sequence of commands and references to skills to get a voice assistant to run the command you want makes voice interfaces more powerful. But they also need to be smarter about connecting up the different things a user says. “It’s not just about being able to handle lexical variation, but also the order in which people want to do things. Some people go in linear fashion, but others want to explore a concept and then come back to something in order to accomplish a task,” Roth said.
SEE: IT leader’s guide to the future of artificial intelligence (Tech Pro Research)
If you ask your voice agent to book a restaurant and there isn’t a table at the time you want, the agent might suggest other times and contact your guests to see if that time works, or find you another restaurant. But if you want to go back to the voice agent an hour after you told it to go ahead and make the booking, and change your mind about where to have dinner, the agent needs to be able to know that you’re talking about something it already did for you, find the details, make the changes and know who needs to get the update.
That could extend to working on multiple devices at different times — the way you already can with lots of Office 365 services. In the concept video Microsoft showed at Build, the voice assistant rescheduled and cancelled meetings and brought up the right documents for the people in the first meeting.
“Imagine getting some work done during your commute in your vehicle and getting to your office, opening your laptop and having the system be aware of where you left off and continuing your thoughts and ideas,” Roth said. “Similarly when you get home, having whatever ambient computing devices are in your home be fully aware of the context of where you were and what you’re working on and what you might be interested in, whether it’s that moment or days or weeks later.”