When humans try to transcribe a spoken conversation all in one go, they manage to miss 5.9 percent of what they hear on average. Microsoft announced on Tuesday that, for the first time, they've managed to get a computer to perform that same transcription task just as well as a person. "We've reached human parity," Microsoft's chief speech scientist Xuedong Huang, said in a statement.
To accomplish the 5.9 percent error rate, which beats a 6.3 percent record set just last month, the Microsoft team leveraged neural language models resembling associative word clouds. That is, a word like "fast" resides much closer to "fast" than it does to "slow". This allowed the speech recognition engine to generalize between words and better recognize them in context. The team relied on Microsoft's homegrown deep learning Computational Network Toolkit to develop its record-setting algorithm.
The team's next goal is to improve the engine's robustness so that it can be used in real-life situations such as on crowded city streets or while driving. They also hope to eventually get it to work with multiple users simultaneously.