Currently, developers use one of two methods to create speech programs. One involves using a large collection of words and speech fragments spoken by a single person, which makes sounds and intonations hard to manipulate. The other forms words electronically, depending on how they're supposed to sound. That makes things easier to tweak, but the results sound much more robotic.
In order to build a speech program that actually sounds human, the team fed the neural network raw audio waveforms recorded from real human speakers. Waveforms are the visual representations of the shapes sounds take -- those squiggly waves that squirm and dance to the beat in some media player displays. As such, WaveNet speaks by forming individual sound waves. (By the way, the AI also has a future in music. The team fed it classical piano pieces, and it came up with some interesting samples on its own.)
For instance, if used as a text-to-speech program, it transforms the text you type into a series of phonemes and syllables, which it then voices out. Subjects who took part in blind tests thought WaveNet's results sounded more human than the other methods'. In the AI's announcement post, DeepMind said it can "reduce the gap between the state of the art and human-level performance by over 50 percent" based on those English and Mandarin Chinese experiments. You don't have to take the team's word for it: We're still far from using a WaveNet-powered app, but you can listen to some samples on DeepMind's website.