Image: iStockphoto

Executives from NVIDIA, Deepgram, and Sharpen gathered via Zoom on Wednesday to discuss the current state of the voice tech industry, as well as where it’s going. Growth in artificial intelligence (AI) technology and machine learning have had a huge hand in lifting the market, but it’s only the beginning.

Voice tech has seen rapid growth in recent years and isn’t predicted to stop: The market is estimated to be worth nearly $32 billion by 2025, a Grand View Research report found. With smart speakers and home assistants like Amazon Alexa, Apple’s Siri, and Google Assistant making voice tech mainstream, most consumers are familiar with the concept.

SEE: Robotics in the enterprise (free PDF) (TechRepublic)

However, the technology is more complex than people may think and it has come a long way.

Panel moderator Jeff Herbst described a concept called the Uncanny Valley. “Basically, what that means is that in order to get something to full computer realism—to see a computer generated object that makes you think it’s real—it has to be really good, getting to 80% or 90%, or maybe even 95%,” said Herbst, NVIDIA vice president of business development and head of Inception GPU Ventures. “I feel like we’re kind of there with voice right now, we’re about to push this over the threshold.”

The technology wasn’t always at this level though, said panelist Scott Stephenson, CEO and co-founder of Deepgram, a deep learning speech recognition system.

“Now is the time [for voice tech]. It isn’t like the ideas of AI are super new, [it’s] many things coming together all at once: Where you have a ton of data, you have a ton of compute…then you have the talent and the know-how in order to put all this together,” Stephenson said.

“In the ’80s you had like the first real applications coming out, but the compute and data just wasn’t there. And there are still more learnings to be had. But push through the ’90s and early 2000s, things are gaining steam and all of that, but there was just a real step function in 2012 to 2015 where it just became obvious that deep learning was the way to put all of this together,” Stephenson said. “Compute is here now, there’s know-how to do it. It’s sort of lagged and [took] a long time to come to fruition, but now it’s actually happening.”

The main components of speech analytics

When referring to speech analytics, there are three main components, said Jon Cohen, NVIDIA senior director of artificial intelligence software.

“The first is speech recognition. There’s audio that [you want] to turn into a transcript. Once you have the text, then you presumably want to analyze that text, whether it’s determining the intent of a request or translating it to another language, or initiating a search query and looking up information from a database. This would be like a natural language understanding,” Cohen said.

“Then in an interactive system, if you’re formulating a response and speaking it back to the user, you need to synthesize a spoken response,” Cohen noted. “This is called speech synthesis or text to speech. It’s kind of the opposite of speech recognition. And what it produces is the audio of the human speech, which hopefully is natural and emotional and doesn’t sound weird and robotic.”

“In a typical interactive setting you want to do all of these things interactively so that you can have a conversation,” Cohen said. “If we’re all having a conversation and you ask me a question, you don’t wait six seconds for me to respond. If you want to actually build a system that’s actually interactive in a useful way, that feels, out of the Uncanny Valley, it actually has to respond very quickly. The computational aspect of all three of these pipelines and chaining them together and getting response back quickly is very important.”

Voice tech use cases for the present and future

Call centers are one of the main areas voice tech is being utilized, said Adam Settle, vice president of product at Sharpen, a contact center platform that uses Deepgram.

“A lot of our use cases in the contact center come back to data accuracy. Contact centers are some of the most data rich organizations on the planet, because it’s down to the minute that they have everything planned out, what agent was at call, chat, etc.,” Settle said.

“Being able to rely on a transcription, both on the front and back end of the call, is paramount to something like emotion detection, because they’re trying to use speech for fraud prevention,” Settle said. “They’re looking for coaching opportunities. They’re looking for really accurate transcription.”

An up and coming area, voice tech is being used in is healthcare, which has been bolstered by the coronavirus pandemic.

“The US government released a dataset of a COVID-19 research articles. It’s like 20,000 research articles on SARS and COVID-19. As a pharmaceutical researcher, you might want to ask, has this particular drug been tried on SARS or MERS patients? Did they see this adverse outcome? How on earth are you going to ask that question from a trove of 20,000 documents?” Cohen said.

“An automated system can ingest documents, understand them to some level, understand the intent of the question, and then try to match the documents with the information you’re looking for,” Cohen said. “That’s a great problem for natural language understanding. And in fact, it’s a problem, a lot of groups, including ours, are working on. You’ll see real progress on things like that.”

Voice tech is also proving helpful in daily use at doctor’s offices for note-taking, Stephenson said.

“How doctors take notes, they have a recording device and they speak into it. Before, maybe a scribe that works in their office transcribes it, but what is becoming very common now is to have a virtual scribe; have somebody, somewhere else transcribe. That market is huge,” Stephenson said. “And there are many companies that are trying to address this right now, but it has not been automated yet at all. The terminology is difficult. Or, if you think a doctor’s handwriting is bad, wait until they speak into a microphone.”

“Another one is from a contactless interaction standpoint, this is something that was actually already being done pre-COVID,” Stephenson said. “You’re a radiologist and you’re looking at X-rays and trying to figure out what’s wrong, you are also doing an electronic health record for that.”

“It used to be that you actually had to type it, but now there are systems that you talk into.
The user experience is horrible, though, everybody hates buying them, but they have to use it,” Stephenson said. “Nevertheless, there’s going to be a revolution there too, and making that smoother, more accurate, and faster.”

For more, check out Voice control: Speaking is better than swiping during the coronavirus on TechRepublic.