Baidu's Deep Voice can quickly synthesize realistic human speech

The text-to-speech system can also change the emotions the words convey.

By Mariella Moon March 9, 2017 2:45 am EST

Baidu has been quietly working on other projects besides self-driving cars at its AI center in Silicon Valley, and now it has revealed one of them to MIT's Technology Review. Apparently, the Chinese tech titan has created a text-to-speech system called Deep Voice that's faster and more efficient than Google's WaveNet. The company says Deep Voice can be trained to speak in just a few hours with little to no human interaction. And since Baidu can control how it speaks to convey different emotions, it can (quickly) synthesize speech that sounds pretty natural and realistic.

Google's WaveNet can also synthesize realistic human speech, but it's quite computationally demanding and hard to use for real-world applications at this point. Baidu says it solved WaveNet's problem by using deep-learning techniques to convert text to phenomes, the smallest unit of speech. It then turns those phonemes into sounds using its speech synthesis network. The system converts the word "hello," for instance, into "(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)" before the speech network pronounces it.

Both steps rely on deep learning and don't need human input. However, the system doesn't control which phonemes or syllables are stressed and how long they're pronounced. That's where Baidu steps in — it switches them around to change the emotions it wants to convey.

While the company says Deep Voice has solved WaveNet's problem, it still requires a ton of computing power. A computer has to generate words to say in 20 microseconds to mimic human-like interaction. Baidu's researchers explain:

"To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units."

Still, the researchers believe real-time speech synthesis is possible. They've already created quickly generated samples and collected feedback through Amazon's Mechanical Turk. They asked a large number of people through the service to rate the quality of their samples, and the results indicate that they're of excellent quality.

Baidu's Deep Voice can quickly synthesize realistic human speech

Recommended