Can machines come up with plausible sounds effects for video? Recently, MIT's artificial intelligence (CSAIL) lab created a sort of Turing test that fooled folks into thinking that machine-created letters were written by humans. Using the same principal, the researchers created algorithms that act just like Hollywood "Foley artists," adding sound to silent video. In a psychological test, it fooled subjects into believing that the computer-generated banging, scratching and rustling was recorded live.
Researchers used a drumstick (chosen for consistency and because it doesn't obscure the video) to hit various objects, including railings, bushes and metal gratings. The algorithm was fed 978 videos with 46,620 actions, helping it recognize patterns in the audiovisual signal. "Training a model to synthesize plausible impact sounds from silent videos, [is] a task that requires implicit knowledge of material properties and physical interactions," according to the paper.
The AI uses deep learning to figure out how sounds relate to video, meaning it finds the patterns on its own without intervention from scientists. Then, when it's shown a new, silent video, "the algorithm looks at the properties of each frame of that video, and matches them to the most similar sounds in the database," says lead author Andrew Owens. As shown in the video (above), it can simulate the differences between someone tapping rocks, leaves or a couch cushion.
In an online study, subjects were more than twice as likely to pick the AI version over the live one as the "real" sound, particularly for non-solid materials like leaves and dirt. In addition, the algorithm can reveal details about an object from its sound: 67 percent of the time, it could tell whether a material was hard or soft.
The AI isn't perfect -- it can get faked out by a near-hit, and can't pick up sounds not related to a visual action, like a buzzing computer. However, they believe the work could eventually help robots figure out whether a surface is cement-solid or has some give, like grass. Knowing that, they could predict how to step and avoid a (hilarious) accident. If the team can enlarge its database of sounds, the machines could eventually do a Foley artist's job -- with no need for coconuts.