Google's WaveNet can also synthesize realistic human speech, but it's quite computationally demanding and hard to use for real-world applications at this point. Baidu says it solved WaveNet's problem by using deep-learning techniques to convert text to phenomes, the smallest unit of speech. It then turns those phonemes into sounds using its speech synthesis network. The system converts the word "hello," for instance, into "(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)" before the speech network pronounces it.
Both steps rely on deep learning and don't need human input. However, the system doesn't control which phonemes or syllables are stressed and how long they're pronounced. That's where Baidu steps in -- it switches them around to change the emotions it wants to convey.
While the company says Deep Voice has solved WaveNet's problem, it still requires a ton of computing power. A computer has to generate words to say in 20 microseconds to mimic human-like interaction. Baidu's researchers explain:
"To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units."
Still, the researchers believe real-time speech synthesis is possible. They've already created quickly generated samples and collected feedback through Amazon's Mechanical Turk. They asked a large number of people through the service to rate the quality of their samples, and the results indicate that they're of excellent quality.