Microsoft recently reached a new milestone in its ability to recognize conversational speech, achieving a 5.1% word error rate (WER). The achievement, detailed in a Sunday blog post, bests Microsoft’s previous record of 5.9% and is closer to human parity.
The new WER was achieved through the use of Switchboard. According to the blog, “Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems.”
Using Switchboard, speech recognition systems are tasked with transcribing conversations about topics such as politics or sports, for example. While Microsoft’s 5.9% rate was originally touted as human parity, researchers said that the 5.1% number is actually a better representation of human parity.
Microsoft’s speech recognition capabilities are based on neural networks, and other artificial intelligence (AI) technologies. The research team was able to improve its capabilities by adding a CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) to boost its acoustic modeling. The team also added predictions from other models at different levels.
“Moreover, we strengthened the recognizer’s language model by using the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation,” the post said.
Additional technologies like the Microsoft Cognitive Toolkit 2.1 (CNTK) and Azure GPUs helped explore architectural differences and improve the speed of the models themselves.
Despite the new levels of WER, Microsoft noted in the post that there are still many challenges to address with speech recognition. For starters, systems need to be able to recognize words in noisy areas, or from mics that are far away. They also need to work on systems that can account for accents and styles of speech, while also teaching the machines to understand the meaning of the words they’re transcribing.
The improvements to Microsoft’s speech recognition tech will go far to improve Cortana, its digital assistant, as well as other tools. For example, the Universal Translator, which works to translate face-to-face conversations in real time, could also benefit.
More information on Microsoft’s speech recognition technology can be found in this technical report.
The 3 big takeaways for TechRepublic readers
- Microsoft recently achieved a 5.1% word error rate for its speech recognition, a new record for its neural network-based technology.
- Microsoft used the Switchboard library of conversations to train its speech recognition system and achieve human parity.
- Challenges persist in speech recognition, such as understanding accents or speaking styles, and having machines understand the words they’re translating.