Translating audio into realistic looking video of a person speaking is quite a challenge. Often, the resulting video just looks off -- a problem called the uncanny valley, which states that human replicas appearing almost but not quite real come off as eerie or creepy. However, researchers at the University of Washington have made some serious headway in overcoming this issue and they did it using audio and video of Barack Obama.
The researchers used 14 hours of Obama's weekly address videos to train a neural network. Once trained, their system was then able to take an audio clip from the former president, create mouth shapes that synced with the audio and then synthesize a realistic looking mouth that matched Obama's. The mouth synced to the audio was then superimposed and blended onto a video of Obama that was different from the audio source. To make it look more natural, the system corrected for head placement and movement, timing and details like how the jaw looked. The whole process is automated save for one manual step that requires a person to select two frames in the video where the subject's upper and lower teeth are front-facing and highly visible. Those images are then used by the system to make the resulting video's teeth look more realistic.
The program isn't perfect yet, but in the video below you can see how much better it gets after three minutes, one hour, seven hours and 14 hours of training data. Some limitations the team has pointed out include occasional mistakes in mouth and facial alignment -- sometimes it gave Obama two chins -- an inability to match emotion and issues arising with sounds that require a particular placement of the tongue, like "th," which isn't currently covered by their program.
But, overall this artificial lip-syncing program creates a much more realistic image than others have. The work will be published in ACM Transactions on Graphics and you can see the researchers' process in the video below.