The team started with an existing approach where two neural networks process the images and audio spectrograms, learning to match an audio caption with images containing a given object. However, they modified the image-handling neural network so that it would split the image into a grid of cells, while the audio network cuts up the spectrogram into short (1-2 second) snippets. After pairing the right image and caption, the training process scores the AI system based on how well the audio segments match objects in the cell grids. Effectively, it's like telling children what they're looking at by pointing at objects and naming them.
There are a number of potential uses, but the researchers are most enamored with the potential for translation. Rather than asking a bilingual annotator to make the connections, you could have people speaking different languages describe the same thing -- the system could assume that one description is a translation of the other. That could make speech recognition viable for many more languages than just the roughly 100 that have enough transcriptions for the old-fashioned method.