The voices on Amazon's Alexa, Google Assistant and other AI assistants are far ahead of old-school GPS devices, but they still lack the rhythms, intonation and other qualities that make speech sound, well, human. NVIDIA has unveiled new research and tools that can capture those natural speech qualities by letting you train the AI system with your own voice, the company announced at the Interspeech 2021 conference.
To improve its AI voice synthesis, NVIDIA’s text-to-speech research team developed a model called RAD-TTS, a winning entry at an NAB broadcast convention competition to develop the most realistic avatar. The system allows an individual to train a text-to-speech model with their own voice, including the pacing, tonality, timbre and more.
Another RAD-TTS feature is voice conversion, which lets a user deliver one speaker's words using another person's voice. That interface gives fine, frame-level control over a synthesized voice’s pitch, duration and energy.
Using this technology, NVIDIA's researchers created more conversational-sounding voice narration for its own I Am AI video series using synthesized rather than human voices. The aim was to get the narration to match the tone and style of the videos, something that hasn't been done well in many AI narrated videos to date. The results are still a bit robotic, but better than any AI narration I've ever heard.
"With this interface, our video producer could record himself reading the video script, and then use the AI model to convert his speech into the female narrator’s voice. Using this baseline narration, the producer could then direct the AI like a voice actor — tweaking the synthesized speech to emphasize specific words, and modifying the pacing of the narration to better express the video’s tone," NVIDIA wrote.
NVIDIA is distributing some of this research — optimized to run efficiently on NVIDIA GPUs, of course — to anyone who wants to try it via open source through the NVIDIA NeMo Python toolkit for GPU-accelerated conversational AI, available on the company's NGC hub of containers and other software.
"Several of the models are trained with tens of thousands of hours of audio data on NVIDIA DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on NVIDIA Tensor Core GPUs," the company wrote.