The tech world has spent years trying to create speech recognition software that listens as well as humans. Now, IBM says it's achieved a 5.5 percent word error rate, down from its previous record of 6.9 percent -- an industry milestone that could eventually lead to improvements in voice assistants like Siri and Alexa.
Microsoft claimed to reach a 5.9 percent word error rate last October using neural language models resembling associative word clouds. At the time, the company believed 5.9 percent was equivalent to human parity. But, IBM says it's not popping the champagne yet. "As part of our process in reaching today's milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent," George Saon, IBM principal research scientist, wrote in a blog post this week.
IBM reached the 5.5 percent milestone by combining so-called Long Short-Term Memory, an artificial neural network, and WaveNet language models with three strong acoustic models. It was then measured using the "SWITCHBOARD" corpus, a collection of telephone conversations that's been used as a benchmark for speech recognition software for decades. SWITCHBOARD is not the industry standard for measuring human parity, however, which makes breakthroughs harder to achieve.
"The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex," said Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, in a statement to IBM. "It's also difficult to define human performance, since humans also vary in their ability to understand the speech of others."