Microsoft's already demonstrated how its computer vision technology can recognize objects even better than humans, now it's onto the next frontier: Interpreting elements of a photo and automatically generating captions. That may not exactly sound exciting, but being able to accurately explain an image could be essential for artificial intelligence. It's also yet another sign of the power of neural networks, or computer models that try to mimic the way the human brain works. Microsoft's technology starts by identifying everything in an image, then it generates sentences around how those objects interact. For example, in the image above it came up with "A purple camera with a woman"; "A woman holding a camera in a crowd"; and "A woman holding a cat." Two of those sentences don't make much sense -- it somehow identifies a bundle of hair as a cat -- so it eventually settled on "A woman holding a camera in a crowd" as the best way to describe the scene.
"We want to connect vision to language because we want to have artificial intelligence tools," Margaret Mitchell, a researcher at Microsoft Research's natural language processing group, said in a blog post today. The technology could lead to a future version of Microsoft's Cortana virtual assistant that can view the world around you and offer helpful tips on the fly, not unlike the Cortana character in Halo.