MIT machine vision system figures out what it's looking at by itself

Just give it 45 minutes or so.

Robotic vision is already pretty good, assuming that it's being used within the narrow bounds of the application for which it's been designed. That's fine for machines that perform a specific movement over and over, such as picking an object off of an assembly line and placing it into a bin. However for robots to become useful enough to not just pack boxes in warehouses but actually help out around our own homes, they'll have to stop being so myopic. And that's where the MIT's "DON" system comes in.

The DON or "Dense Object Nets" is a novel form of machine vision developed at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). It generates a "visual roadmap" -- basically, collections of visual data points arranged as coordinates. The system will also stitch each of these individual coordinate sets together into a larger coordinate set, the same way your phone can mesh numerous photos together into a single panoramic image. This enables the system to better and more intuitively understand the object's shape and how it works in the context of the environment around it.

"At its coarsest, highest level, what you'd get from your computer vision system is object detection," PhD student Lucas Manuelli, author of the paper, told Engadget. "The next finest level would be to do pixel labeling. So that would say, okay, all these pixels are a part of a person or part of the road or the sidewalk. Those first two levels are pretty much a lot of what self-driving car systems would use."

"But if you're actually trying to interact with an object in a particular way like grab a shoe in a particular way or grab a mug," he continued, "then just having a bounding box or just all these pixels correspond to the mug, isn't enough. Our system is really about getting into the finer level of details within the object... that kind of information is necessary for doing more advanced manipulation tasks."

That is, the DON system will allow a robot to look at a cup of coffee, properly orient itself to the handle, and realize that the bottom of the mug needs to remain pointing down when the robot picks up the cup to avoid spilling its contents. What's more, the system will allow a robot to pick a specific object out of a pile of similar objects.

"Many approaches to manipulation can't identify specific parts of an object across the many orientations that object may encounter," Manuelli, wrote in the study. "For example, existing algorithms would be unable to grasp a mug by its handle, especially if the mug could be in multiple orientations, like upright, or on its side."

The system relies on an RGB-D sensor which has a combination RGB-depth camera. Best of all, the system trains itself. There's no need to feed the AI hundreds upon thousands of images of an object to the DON in order to teach it. If you want the system to recognize a brown boot, you simply put the robot in a room with a brown boot for a little while. The system will automatically circle the boot, taking reference photos which it uses to generate the coordinate points, then trains itself based on what it's seen. The entire process takes less than an hour.

"In factories robots often need complex part feeders to work reliably," Manuelli wrote. "But a system like this that can understand objects' orientations could just take a picture and be able to grasp and adjust the object accordingly."

The technology is still in its infancy, so don't hold your breath for robot maids to empty your dishwasher any time soon. But eventually, Manuelli hopes that these machines, with their improved eyesight and coordination, become a member of your (ware)household.