There are essentially three sets of code that Facebook is putting on GitHub today. They're called DeepMask, SharpMask and MultiPathNet: DeepMask figures out if there's an object in the image, SharpMask delineates those objects and MultiPathNet attempts to identify what they are. Combined, they make up a visual-recognition system that Facebook says is able to understand images at the pixel level, a surprisingly complex task for machines.
"There's a view that a lot of computer vision has progressed and a lot of things are solved," says Piotr Dollar, a research scientist at Facebook. "The reality is we're just starting to scratch the surface." For example, he says, computer vision can currently tell you if an image has a dog or a person. But a photo is more than just the objects that are in it. Is the person tall or short? Is it a man or a woman? Is the person happy or sad? What is the person doing with the dog? These are questions that machines have a lot of difficulty answering.
In the blog post, he describes a photo of a man next to an old-fashioned camera. He's standing in a grassy field with buildings in the background. But a machine sees none of this; to a machine, it's just a bunch of pixels. It's up to computer-vision technology like the one developed at FAIR to segment each object out. Considering that real-world objects come in so many shapes and sizes as well as the fact that photos are subject to varying backgrounds and lighting conditions, it's easy to see why visual recognition is so complex.
The answer, Dollar writes, lies in deep convolutional neural networks that are "trained rather than designed." The networks essentially learn from millions of annotated examples over time to identify the objects. "The first stage would be to look at different parts of the image that could be interesting," he says. "The second step is to then say, 'OK, that's a sheep,' or 'that's a dog.'
"Our whole goal is to get at all the pixels, to get at all the information in the image," he says. "It's still sort of a first step in the grand scheme of computer vision and having a visual recognition system that's on par with the human visual system. We're starting to move in that direction."
By open-sourcing the project on GitHub, he hopes that the community will start working together to solve any problems with the algorithm. It's a step that Facebook has taken before with other AI projects, like fasText (AI language processing) and Big Sur (the hardware that runs its AI programs). "As a company, we care more about using AI than owning AI," says Larry Zitnick, a research manager at FAIR. "The faster AI moves forward, the better it is for Facebook."
One of the reasons Facebook is so excited about computer vision is that visual content has exploded on the site in the past few years. Photos and videos practically rule News Feed. In a statement, Facebook said that computer vision could be used for anything from searching for images with just a few keywords (think Google Photos) to helping those with vision loss understand what's in a photo.
There are also some interesting augmented reality possibilities. Computer vision could identify how many calories are in a photo of a sandwich, for example, or it could see if a runner has the proper form. Now imagine if this kind of information was accessible on Facebook. It could bring a whole new level of interaction to the photos and videos you already have. Ads could let you arrange furniture in a room or try on virtual clothes. "It's critical to understand not just what's in the image, but where it is," says Zitnick about what it would take for augmented reality applications to take off.
Dollar brought up Pokémon Go as an example. Right now the cartoon monsters are mostly just floating in the middle of the capture scene. "Imagine if the creature can interact with the environment," he says. "If it could hide behind objects, or jump on top of them."
The next step would be to bring this computer-vision research into the realm of video, which is especially challenging because the objects are always moving. FAIR says that some progress has already been made: It's able to figure out certain items in a video, like cats or food. If this identification could happen in real time, then it could theoretically be that much easier to surface the Live videos that are the most relevant to your interests.
Still, with so many possibilities, Zitnick says FAIR's focus right now is on the underlying tech. "The fundamental goal here is to create the technologies that enable these different potential applications," he says. Making the code open-source is a start.