Nearly half of the world’s roughly 7,000 known languages four in ten of them exist without an accompanying written component. These unwritten languages pose a unique problem for modern machine learning translation systems, as they typically need to convert verbal speech to written words before translating to the new language and reverting the text back to speech, but one that Meta has reportedly addressed with its latest open-source language AI advancement.
As part of Meta’s Universal Speech Translator (UST) program which is working to develop real-time speech-to-speech translation so that Metaverse denizens can more easily interact (read: sexually harass one another). As part of this project, Meta researchers looked at Hokkien, an unwritten language spoken throughout Asia’s diaspora and one of Taiwan’s official languages.
Machine learning translation systems typically require extensive labelable examples of the language, both written and spoken, to train on — precisely what unwritten languages like Hokkien don’t have. To get around that, “we used speech-to-unit translation (S2UT) to convert input speech to a sequence of acoustic units directly in the path previously pioneered by Meta,” CEO Mark Zuckerberg explained in a Wednesday blog post. “Then, we generated waveforms from the units. In addition, UnitY was adopted for a two-pass decoding mechanism where the first-pass decoder generates text in a related language (Mandarin), and the second-pass decoder creates units.”
“We leveraged Mandarin as an intermediate language to build pseudo-labels, where we first translated English (or Hokkien) speech to Mandarin text, and we then translated to Hokkien (or English) and added it to training data,” he continued. Currently, the system allows for someone who speaks Hokkien to converse with someone who speaks English, albeit stiltedly. The model can only translate one full sentence at a time but Zuckerberg is confident that the technique can eventually be applied to more languages and will improve to the point of offering real-time translation.
In addition to the models and training data that Meta is already open-sourcing from this project, the company is also releasing a first-of-its-kind speech-to-speech translation benchmarking system based on a Hokkien speech corpus called Taiwanese Across Taiwan, as well as “the speech matrix, a large corpus of speech-to-speech translations mined with Meta’s innovative data mining technique called LASER,” Zuckerberg announced. This system will empower researchers to create speech-to-speech translation (S2ST) systems of their own.