As a chronic mumbler I’ve had some significant challenges getting Amazon’s Alexa to comprehend my commands. I find I have to enunciate very clearly, and I have a very bland northern American accent. It’s even worse for my native Massachusetts friends who have very thick Boston accents.
SEE: TechRepublic Premium editorial calendar: IT policies, checklists, toolkits, and research for download (TechRepublic Premium)
Clearly, Alexa is a basic consumer-level artificial intelligence (AI) product, but AI usage in business demands a higher standard. Being able to correctly input language that can be consistently understood by AI software is essential for a company’s return on investment in such products.
I spoke about the concept of language, how it interfaces with AI, and how, with Ian Firth, VP of products at Speechmatics, a speech-recognition software development company, and Dan Kobran, co-founder, Paperspace, an AI development platform.
Scott Matteson: What is the AI accent gap, and what challenges does it cause?
Ian Firth: Humans can often find it hard to communicate even when they are from the same city or country—even though the language is the same. The variety of accents and dialects in one single language can be huge, and trying to understand them all as a human being is challenging in itself.
When it comes to automatic speech recognition (ASR) technology, the same applies. The engine is required to understand varieties of accents, dialects, and even slang within a single language. To get the value out of what people are saying—as a human or ASR engine—you need to understand what is being said.
Accents and dialects add an extra barrier to the ability to communicate. When it comes to ASR technology, voice needs to be comprehended and actioned in a simple and easy way. The challenge for speech technology is to break down the language barrier and deliver understanding, context, and value to a conversation or speaker.
Scott Matteson: What possible solutions are involved?
Ian Firth: There are two possible solutions when it comes to tackling the challenge with language accents and dialects.
The first, is to make a speech recognition engine that is designed to work best for accent-specific language models. For example, this means creating a language pack for Mexican Spanish, Spanish Spanish, and so on. With this approach, you get great accuracy for one specific accent and—academically speaking—you will get highly accurate results in most cases. This approach requires the right model, for the right speech, and there are circumstances where this solution doesn’t work.
SEE: Natural language processing: A cheat sheet (TechRepublic)
The second solution is to build an any-context speech recognition engine that understands all Spanish accents regardless of the region, accent, or dialect. This approach does have its own challenges around the technical ability to build an engine in this way and the time it takes to build. However, the results speak for themselves with frictionless and seamless user and customer experiences.
Scott Matteson: How do the solutions work from a technological perspective?
Ian Firth: ASR was a technology and not a product when it was first brought to market. Engineers would ask themselves, “How do we get the best accuracy results from what we have?” So traditionally, ASR engineers only considered the accent-specific solution as viable way to address this problem and the accent gap. From an engineering perspective, it made sense to constrain the problem to a single-accent model because it was the best way to deliver the best accuracy results for the specific accent or dialect.
This approach also required ASR providers to build specific models for specific markets. For example, a medical company would require a completely different vocabulary to a utilities company, and this raises a huge challenge when it comes to ASR technology. If we look back to the late 1990s, engines required the user to train the ASR to their voice rather than the engine being speaker-independent.
As compute and machine learning (ML) has improved and evolved over the past 10 years, ASR providers have been able to widen the boundaries of what is possible with voice technology. As it became more widely adopted, it was apparent to engineers that you would never know the accent or dialect of the speaker before they used the technology, only the language. So how do you select the correct model? You have to make assumptions and a best guess as adoption increased and became more globalized, the problem became more apparent.
How did we solve this problem? With an all-encompassing language model, you may not get the best accuracy for a specific speaker, but you are likely to get the best accuracy across the board for that specific language. We set about building an any-context speech recognition engine where we could build accent-agnostic language models. We found a way to build language models that were small enough in footprint, which makes our ASR consumable in the real world.
It cannot just be pure math behind the machine learning, the real-world applicability needs to be understood and how does the technology add value to businesses.
SEE: Windows 10: Lists of vocal commands for speech recognition and dictation (TechRepublic)
Scott Matteson: What are the benefits for businesses and consumers?
Ian Firth: Ultimately, an accent-agnostic approach is beneficial to everyone. The business reduces its costs by running just one model for one language, and the consumers get the best experience and value because they are understood every time.
This solution also in turn benefits the ASR provider. It is a labor-intensive task to keep language models up to date and improving, so reducing the number of language models means the ASR provider can also deliver customers the best service and technology.
Scott Matteson: Where is the trend headed?
Ian Firth: I still believe an accent-agnostic approach is the right solution to the problem of the accent gap in speech recognition. We can’t expect consumers to adapt their accent or dialect to suit a language model—the ASR provider is responsible for doing that.
At Speechmatics, we have now created Global English and Spanish, and we will continue to roll out global languages. We want to achieve building a global language for wherever possible and as ASR continues to become more accurate, we will continue to make this possible.
It is important to note that from a cost, build, and consumer experience perspective, it isn’t sustainable to continue to build more accent-specific languages packs. ASR is growing as an industry and will continue to grow as increasingly everyone is the world needs to be supported by speech technology. This has become exceedingly more apparent and accelerated due to COVID-19 this year and the rate of adoption for use cases such as captioning, transcription, monitoring, asset management, web conferencing, contact center analytics.
Scott Matteson: What is natural language understanding?
Dan Kobran: Natural language understanding is a subtopic of AI that basically means reading comprehension. One reason it’s a celebrity subtopic is because there’s not really a difference between solving NLU and solving generalized AI. So, when we’re talking about the dream of NLU we’re really talking about the dream of AI itself: To match and then augment human intelligence.
Scott Matteson: Why is it getting so much hype these days?
Dan Kobran: NLU is not new. We’ve been trying to figure out how to get machines to understand the infinite variety of human language for decades. What’s new is that there are some great new enabling technologies that are showing lots of promise and that we are becoming more aware of NLU applications in our everyday lives. Some of the most common production applications right now include machine translation of text between languages on the internet, question answering by a smart assistant like Siri or Alexa, and sentiment analysis for customer requests on the phone or in chat.
Scott Matteson: From an AI practitioner’s perspective, what makes NLU particularly challenging?
Dan Kobran: Language is difficult! We say things literally, or tacitly, or just barely hint at them, or allegorize them, or leave them in the empty space between sentences—ad infinitum. Language is a representation of thought (albeit perhaps a lossy one) and there’s an awful lot for a ML model to learn. That’s why NLU isn’t solved with some breakthrough single algorithm but rather through generalized AI because the complexity of language is a proxy for the complexity of intelligence more generally.
Scott Matteson: What is OpenAI’s GPT-3 and how does it work? What are the benefits and requirements?
Dan Kobran: GPT-3 is a language model that’s been pre-trained on 175B parameters and is especially good at predicting and generating text. In other words, it’s a language model that’s already read A LOT of stuff and can use that knowledge to predict what comes next when given an input. More specifically, it’s a transformer (a certain kind of neural network-based model) that benefits from being able to process data in parallel rather than in sequence. So, it’s easy to work with, easy to train, and already comes with some startling capabilities out of the box.
Scott Matteson: What are some subjective examples of GPT-3 in action?
Dan Kobran: GPT-3 is unfortunately closed source due to a licensing deal between OpenAI and Microsoft.
However some exciting use cases have already emerged including automatic email writing, semantic programming (e.g. describe what you want your application to do in lay terms), conversational chatbots, and more.
There are also some extremely exciting possibilities waiting to be realized such as training GPT-3 on medical literature to build a reference or Q&A bot for physicians and health researchers.
Scott Matteson: Do you think GPT-3 is ready for mainstream use?
Dan Kobran: GPT-3 is trained on two orders of magnitude more parameters than GPT-2, so to some extent it’s the same technology only greatly improved.
It’s clear that GPT-3 is ready today for narrow use cases that require universal language models but beyond that GPT-3 is not context-aware and is therefore limited in its fundamental abilities and applications.
Scott Matteson: What needs to happen in order for GPT-3, or any NLU framework for that matter, to really work in an enterprise setting today?
Dan Kobran: As Professor Yann LeCun recently pointed out, GPT-3 is not a sentient intelligence. It’s a language model that can produce sentences one word at a time. GPT-3 does not actually understand the world around it or much of anything beyond the patterns it’s found in language.
Yet GPT-3 is an enormous step toward useful AI. It’s already useful today for certain text generation applications but the lack of understanding beyond a shallow depth is the factor limiting its usefulness today.