IBM recently announced that it reached a new industry record in conversational speech recognition, which could have big implications for the future of artificial intelligence (AI).
The IBM team’s system achieved a 5.5% word error rate–down from 6.9% last year. The benchmark was measured on a difficult speech recognition task, with the machine deciphering recorded conversations between humans discussing day-to-day topics such as buying a car. This recording is known as SWITCHBOARD, and has been used for more than two decades to test speech recognition systems, according to a blog post by George Saon, a principal research scientist at IBM.
IBM used deep learning technologies to reach the 5.5% record. Researchers combined Long Short Term Memory (LSTM) and WaveNet language models with three acoustic models, according to the blog post.
“Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning,” Saon wrote. “The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.”
The previous record was set by Microsoft’s Artificial Intelligence and Research group in October 2016, when researchers developed a system that they claimed recognized speech as accurately as a professional human transcriptionist, with a word error rate of 5.9%. However, Saon argued in his post that human parity is actually a 5.1% word error rate–lower than any company has yet to achieve.
“We’re not popping the champagne yet,” Saon wrote. “While our breakthrough of 5.5% is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.”
Reaching human-level performance in AI tasks such as speech or object recognition remains a scientific challenge, according to Yoshua Bengio, leader of the University of Montreal’s Montreal Institute for Learning Algorithms (MILA) Lab, as quoted in the blog post. Standard benchmarks do not always reveal the variations and complexities of real data, he added. “For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition,” Bengio said.
Saon also noted that finding a standard measurement for human parity is a complex task as well. While many use SWITCHBOARD, another corpus called CallHome offers a different set of linguistic data created from colloquial conversations between family members, on topics that are not pre-arranged. These conversations are more difficult for machines to transcribe than those from SWITCHBOARD. IBM achieved a 10.3% error rate on this measure, but determined that human parity would be 6.8%.
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex,” said Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, in the blog post. “It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated.”
IBM’s breakthrough could have major implications for the future of AI and the Internet of Things (IoT) in the enterprise, according to Mark Hung, research vice president and lead analyst of Internet of Things at Gartner.
“With the proliferation of conversational AI platforms such as Alexa and Google Assistant, continued reductions in error rate will be imperative to drive greater adoption of speech as the UI for consumer and enterprise applications,” Hung said.
IBM has recently made major investments in its Watson division, with a new $200 million global headquarters for Watson Internet of Things opening recently in Munich, Germany as part of a $3 billion investment in IoT that IBM pledged in 2014. IBM also recently added diarization to its Watson Speech to Text service, making it possible for the processor to distinguish individual speakers in a conversation.
The 3 big takeaways for TechRepublic readers
1. Last week, IBM announced that it achieved a new industry record in speech recognition, with a 5.5% word error rate.
2. Last year, Microsoft claimed to have reached human parity with its speech recognition system’s error rate of 5.9%, but IBM researchers argue that human parity would actually be 5.1%.
3. IBM’s breakthrough could have implications for improving the use of artificial intelligence and the Internet of Things in the enterprise.