Facebook's latest AI can learn speech without human transcriptions

It could help bring automatic translations to more countries.

Stephen Lam / Reuters

Speech recognition is an important cog in Big Tech's AI machinery. The tech powers the digital assistants on our phones, in cars and in the smart speakers in our homes. But, despite its ubiquity, speech recognition is still a work in progress. Today, Facebook is heralding a major breakthrough in the way it trains these systems to learn new languages. The company says it has developed a method of building speech recognition tools that don't require transcribed data.

According to Facebook, its novel system can unshackle the tech from its reliance upon text-to-speech input. The time consuming task involves humans listening to and transcribing hours of audio, a monotonous process that has to be repeated for each language. Whereas Facebook's "unsupervised" system learns purely from speech audio and unpaired text to give it a better sense of what human communication sounds like.

Facebook's model essentially relies on a feedback loop between a generative adversarial network (GAN) composed of a "generator" and a "discriminator." The former spits out representations of uploaded speech patterns that look like complete gibberish until they are put through the corresponding discriminator network, which acts as a translator of sorts. At the same time, Facebook inputs additional text written by humans to help the generator to glean the difference between computerized and real world results. This process is repeated until the generator's output matches real text.

Facebook says its method has allowed it to create speech recognition systems without any annotated data sets. The company has already tested the model — known as Wav2vec-U (the U stands for Unsupervised) — on Swahili, Kyrgyz (spoken in the Central Asian republic of Kyrgyzstan) and Crimean Tatar, all of which lack high-quality speech recognition tools due to a disparity of training data.

Facebook's tests showed that the system delivered 63 percent less errors than the next best unsupervised method. It adds that the tool is as accurate as supervised systems from a few years ago. In order to accelerate its development, Facebook has shared the code for Wav2vec-U on GitHub.

The company says the breakthrough could usher in speech recognition systems for more languages and dialects around the world, helping to democratize the tech. Naturally, it stands to benefit from this proliferation: More than 76 percent of Facebook's 2.85 billion monthly users are located outside of North America and Europe. And automatic translation is crucial to its goal of connecting billions of people through their preferred language.