Deep learning systems can already detect objects in a given scene, including people, but they can't always make sense of what people are doing in that scene. Are they about to get friendly? MIT CSAIL's researchers might help. They've developed a machine learning algorithm that can predict when two people will high-five, hug, kiss or shake hands. The trick is to have multiple neural networks predict different visual representations of people in a scene and merge those guesses into a broader consensus. If the majority foresees a high-five based on arm motions, for example, that's the final call.
The algorithm is also useful for determining when objects are likely to appear. If a character reaches into a microwave, for instance, it may decide that a coffee cup is likely to come out.
The technology isn't ready for prime time. The algorithm was only right about 43 percent of the time when predicting affection (versus 71 percent for humans), and it's only really effective based on what it sees a few seconds ahead of time. People can frequently tell what will happen further in advance based on subtler cues like dialogue and facial expressions.
If CSAIL can refine its technique, though, there are wide-ranging implications for robotics and other AI-guided tasks. You could make automatons that know how to respond to human interactions, such as drawing the connections between friends or catching someone who's about to fall. Security cameras could deliver alerts based on the actions they see, too, such as warning a company that you might need an ambulance. In short: AI is about to get better at understanding the implications of what it sees, not just the immediate conditions.