A picture may be worth a thousand words, but sound is just as important to how we experience the world as how we see it -- that's why a team at Disney Research is working on a computer vision system that can not only recognize what an image is, but how it sounds, too. In an initial study presented at the European Conference on Computer Vision, the group's system successfully managed to pair appropriate audio with images of doors closing, glasses clinking and vehicles driving down the road.
Audio association might be easy for humans, but teaching a computer to do it is actually pretty challenging. Disney researchers trained AI to recognize the sound of images by feeding it a collection of videos demonstrating an object making a specific sound, but background noise, narration or sound made from other objects could easily confuse the system. If the system was fed samples with most of the uncorrelated sounds filtered out, however, it did a pretty good job of suggesting the right sound for each image. Still, the system isn't perfect: the team reports that it occasional had trouble differentiating the image of a car or a tram, causing it to sometimes suggest the wrong sound for a particular vehicle.
Audio image recognition probably isn't useful to most of the population, but the team hopes it can be used to create an automatic Foley processing system for video production -- making it easier for editors to add-in sound-effects during the production process. The technology may also be able to help the visually impaired by creating an image personification system, enabling them to 'hear' objects on a computer screen. Still, Disney Research has a lot of work to do before it gets close to making either of those futures a reality.