Describing an image accurately, and not just like a clueless robot, has long been the goal of AI. In 2016, Google said its artificial intelligence could caption images almost as well as humans, with 94 percent accuracy. Now Microsoft says it’s gone even further: Its researchers have built an AI system that’s even more accurate than humans — so much so that it now sits at the top of the leaderboard for the nocaps image captioning benchmark. Microsoft claims its two times better than the image captioning model it’s been using since 2015.
And while that’s a notable milestone on its own, Microsoft isn’t just keeping this tech to itself. It’s now offering the new captioning model as part of Azure's Cognitive Services, so any developer can bring it into their apps. It’s also available today in Seeing AI, Microsoft's app for blind and visually impaired users that can narrative the world around them. And later this year, the captioning model will also improve your presentations in PowerPoint for the web, Windows and Mac. It’ll also pop up in Word and Outlook on desktop platforms.
"[Image captioning] is one of the hardest problems in AI,” said Eric Boyd, CVP of Azure AI, in an interview with Engadget. “It represents not only understanding the objects in a scene, but how they’re interacting, and how to describe them.” Refining captioning techniques can help every user: It makes it easier to find the images you’re looking for in search engines. And for visually impaired users, it can make navigating the web and software dramatically better.
It’s not unusual to see companies tout their AI research innovations, but it’s far rarer for those discoveries to be quickly deployed to shipping products. Xuedong Huang, CTO of Azure AI cognitive services, pushed to integrate it into Azure quickly because of the potential benefits for users. His team trained the model with images tagged with specific keywords, which helped give it a visual language most AI frameworks don’t have. Typically, these sorts of models are trained with images and full captions, which makes it more difficult for the models to learn how specific objects interact.
“This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory,” Huang said in a blog post. That’s what gives this new model a leg up in the nocaps benchmark, which is focused on determining how well AI can caption images they have never seen before.
But while beating a benchmark is significant, the real test for Microsoft’s new model will be how it functions in the real world. According to Boyd, Seeing AI developer Saqib Shaik, who also pushes for greater accessibility at Microsoft as a blind person himself, describes it as a dramatic improvement over their previous offering. And now that Microsoft has set a new milestone, it’ll be interesting to see how competing models from Google and other researchers also compete.