After releasing its Portal video-calling tool to largely positive reviews (especially from its employees) last November, Facebook is finally cracking open the device and giving the rest of us a glimpse at the Portal's inner workings. Engadget sat down with Facebook's Rafa Camargo, Vice President of Hardware, and Matt Uyttendaele, Engineering Director of Mobile Vision to discuss the device's development and the artificial intelligence that powers Portal.
When Facebook's AI research group (FAIR) began working on the systems that would eventually become the Portal two years ago, the team asked itself, "How do we create an automated a camera that will feel natural, will feel engaging and would actually not get in the way," Camargo explained to Engadget. "The key thing for us, is really invoking that we are connected in the two rooms and making you feel like you're there and just hanging out."
In order to create that effect, the Portal team designed the device's Smart Camera to mimic the movements and judgements of human camera operators. That involved collaborating with "award winning film directors, documentary producers, and camera people," Camargo said. Their feedback helped steer development until "we're essentially to the point where literally the camera disappears because it becomes so natural that you just don't notice the camera. You just see the scene and what's happening."
Accomplishing that is harder than it sounds, mind you. As the FAIR team explain, the Portal was originally slated to use a mechanical camera. However, it suffered a number of drawbacks, including an increased likelihood of breakdown and the inability to react to events and actions happening off camera. "Smart Camera," the FAIR team wrote, "which was always a key component of our product, became increasingly central to our planned reinvention of the video-calling experience."
Once they settled on what hardware to employ, the Portal team set about creating the AI that would command it. They started with the Mask R-CNN model that FAIR had released in 2017. The Mask model is a body detection system that "detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance," according to Facebook's research blog. "Mask R-CNN is a really elegant solution to finding objects in an image. It can be applied to lots of things but it works really well for this [application] as well," Uyttendaele told Engadget.
However, the model as it existed in 2017, was not suitable for use with the mobile chipsets that the company was using in the Portal. For one thing, Mask R-CNN only operated at 5fps and was very processor intensive. Camargo said using the existing Mask model would have required additional processing and cooling hardware that would have made the Portal more expensive and less reliable. "Mask R-CNN2go, however, "is very tuned to the compute constrained mobile environment."
In response, Facebook's research teams streamlined the model until it was only a few megabytes in size and dubbed it Mask R-CNN2Go. Despite its reduced footprint, Mask 2 Go runs 400 times faster than its predecessor while maintaining pose-detection accuracy. The new model also improved "low-light performance by applying data augmentation on low-light examples in the training data set and balanced multiple pose-detection approaches," the FAIR team wrote.
"The original Mask R-CNN paper threw a lot of capacity at identifying humans in all sorts of different situations -- skiing, on a horse, in an outdoor environment," Uyttendaele continued. However since stability and efficiency were key goals of the Portal's development (not to mention nobody was going to be video conferencing while astride a steed), there wasn't call for training the system on much more than humans doing stuff indoors.
But even in an enclosed environment, there are plenty of things around to confuse a poorly-trained AI, which is part of the reason why the Mask2Go model focuses on body, rather than facial, detection.
"We really needed Portal to understand the full body position -- in real time, all the time -- to frame you," Camargo stated. If, for example, you're laying on the couch and are covered by a blanket, the system needs to realize that it might only be seeing your face and that your body position will be horizontal, rather than vertical "because it would frame you differently, would zoom into you differently."
As such, the Smart Camera's AI analyzes every frame of the video call. This allows it to effectively track (or, conversely, ignore) various human-like objects in the scene. That is, if you're calling your Grandparents on a Portal, the system can actively track your elderly relatives while ignoring the life-size portrait hanging on the wall behind them because it "sees" that they're shifting in their chairs and fidgeting, while the man-sized painting behind them remains unmoved from frame to frame.
Given the recent spate of hackers infiltrating internet connected home security cameras and baby monitors, Facebook made sure to bake a degree of privacy and security into the hardware itself. "The whole AI engine that is doing analytics on the camera runs local, so everything stays local," Camargo insisted. "None of that ever leaves your home or wherever you're putting the device. The only media feed that leaves the device is the final result and it's when you're in a call. And it's only going to the people on the other side of the call."
While the Portal and its larger variant, the Portal+, are currently on store shelves, Facebook is not done developing its Mask2Go model and Smart Camera tech (or content that runs on it). The company is working on an AR-based feature called Story Time, for example. "When you truly hang out with people you're not just chatting or talking, you actually do activities together," Camargo explained. "So we see a lot of potential actually, to bring AR as a way to actually help people engage and feel deeper and stay more time engaged together through the connection."
This technology could eventually make the leap to other devices as well. "Something that drives us is making sure that these computer driven algorithms can run across our community's set of devices," Uyttendaele concluded. "We focus a lot on lower end phones and as we do that, we make computer vision more and more performant over time. I think, because of that, we will bring a more optimized 2D pose tracker to Portal in the future."