Today, we are one step closer to the immortal celebrity future we have long been promised (since April). Meta has unveiled Voicebox, its generative text-to-speech model that promises to do for the spoken word what ChatGPT and Dall-E, respectfully, did for text and image generation.
Essentially, its a text-to-output generator just like GPT or Dall-E — just instead of creating prose or pretty pictures, it spits out audio clips. Meta defines the system as “a non-autoregressive flow-matching model trained to infill speech, given audio context and text.” It’s been trained on more than 50,000 hours of unfiltered audio. Specifically, Meta used recorded speech and transcripts from a bunch of public domain audiobooks written in English, French, Spanish, German, Polish, and Portuguese.
That diverse data set allows the system to generate more conversational sounding speech, regardless of the languages spoken by each party, according to the researchers. “Our results show that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech.” What’s more the computer generated speech performed with just a 1 percent error rate degradation, compared to the 45 to 70 percent drop-off seen with existing TTS models.
The system was first taught to predict speech segments based on the segments around them as well as the passage’s transcript. “Having learned to infill speech from context, the model can then apply this across speech generation tasks, including generating portions in the middle of an audio recording without having to recreate the entire input,” the Meta researchers explained.
Voicebox is also reportedly capable of actively editing audio clips, eliminating noise from the speech and even replacing misspoken words. “A person could identify which raw segment of the speech is corrupted by noise (like a dog barking), crop it, and instruct the model to regenerate that segment,” the researchers said, much like using image-editing software to clean up photographs.
Text-to-Speech generators haver been around for a minute — they’re how your parents’ TomToms were able to give dodgy driving directions in Morgan Freeman’s voice. Modern iterations like Speechify or Elevenlab’s Prime Voice AI are far more capable but they still largely require mountains of source material in order to properly mimic their subject — and then another mountain of different data for every. single. other. subject you want it trained on.
Voicebox doesn’t, thanks to a novel new zero-shot text-to-speech training method Meta calls Flow Matching. The benchmark results aren’t even close as Meta’s AI reportedly outperformed the current state of the art both in intelligibility (a 1.9 percent word error rate vs 5.9 percent) and “audio similarity” (a composite score of 0.681 to the SOA’s 0.580), all while operating as much as 20 times faster that today’s best TTS systems.
But don’t get your celebrity navigators lined up just yet, neither the Voicebox app nor its source code is being released to the public at this time, Meta confirmed on Friday, citing “the potential risks of misuse” despite the “many exciting use cases for generative speech models.” Instead, the company released a series of audio examples (see above/below) as well as a the program’s initial research paper. In the future, the research team hopes the technology will find its way into prosthetics for patients with vocal cord damage, in-game NPCs and digital assistants.