Facebook hopes its new AI moderation tools can further counter hate speech

The company's human moderators remain unconvinced.

Francois Lenoir / reuters

Facebook has waged a long-fought and sometimes seemingly losing battle against hate speech and misinformation spreading across its platform. On Thursday, the company rolled out the latest implements of its automated anti-trolling arsenal in an effort to further curb bigots and bad actors on the site.

The company’s CTO, Mike Schroepfer, noted that Facebook has taken a number proactive steps in the last year to combat hate speech and those efforts have already begun to show results. In the first quarter of 2020, the company took action against 9.6 million pieces of content, almost double the 5.7 million in the quarter prior. “Q3 of last year to Q3 of this year, on Facebook, we've actually done over three times as much content takedowns via our automated systems, detecting hate speech,” Schroepfer told an assembly of reporters via Zoom on Wednesday. “There's not a lot in life that improves three x over a year. So I think that's, that's pretty good.”

Instagram also saw a large influx of automated takedowns in the last quarter, effectively doubling the rate of the same period before it. “[We] are now at a similar practice rate on Instagram, as we are on Facebook,” Schroepfer continued. “So we're seeing about a 95 percent proactive rate on both of those platforms.“

Of course, the baselines for those figures are continually in flux. “COVID misinformation didn't exist in Q4 of 2019, for example,” he said. “And there can be quite a change in a conversation during an election. So what I'd say is you always have to look at all these metrics together, in order to get the biggest picture.”

In addition to Facebook’s existing array of tools including semi-supervised self-learning models and XLM-R, the company unveiled and implemented a pair of new technologies. The first, Schroepfer said, is Linformer, “which is basically an optimization of how these large language models work that allow us to deploy them sort of at the massive scale, we need to address all the content we have on Facebook.”

Linformer is a first-of-its-kind Transformer architecture. Transformers are the model of choice for a number of natural language processing (NLP) applications but unlike the recurrent neural networks that came before them, Transformers can process data in parallel which makes training models faster. But the parallel processing is resource hungry, requiring exponentially more memory and processing cycles to function as the input length increases. Linformer is different. Its resource needs and input length operate under a linear relationship, allowing it to process more inputs using fewer resources than conventional Transformers.

The other new tech is called RIO. “Instead of the traditional model for all of the things I talked about over the last five years,” Schroepfer said. “Take a classifier, build it, train it tested offline, maybe test it with some online data and then deploy it into production, we have a system that can end-to-end learn.

Specifically, RIO is an end-to-end optimized reinforcement learning (RL) framework that generates classifiers -- the tests that trigger an enforcement action against a specific piece of content based on the class associated with its datapoint (think, the process that determines whether or not an email is spam) -- using online data.

“What we typically try to do is set up our classifiers to work at a very high threshold, which means sort of when in doubt, it doesn't take an action,” Schroepfer said. “So we only take an action when the classifier is highly confident, or we're highly confident based on empirical testing, that that classifier is going to be right.”

Those thresholds regularly change depending on the sort of content that is being examined. For example, the threshold for hate speech on a post is quite high because the company prefers not to mistakenly take down non-offending posts. The threshold for spammy ads, on the other hand, is quite low.

In Schroepfer’s hate speech example, the metrics RIO is pulling are regarding prevalence rates. “It's actually using some of the prevalence metrics and others that we released as its sort of score and it's trying to take those numbers down,” Schroepfer explained. “It is really optimizing from the end objective all the way backwards, which is a pretty exciting thing.”

“If I take down 1000 pieces of content that no one was going to see anyway, it doesn't really matter, Schroepfer stated. “If I catch the one piece of content that it was about to go viral before it does that, that can have a massive, massive impact. So I think that prevalence is our end goal in terms of the impact that has on users, in terms of how we're making progress on these things.”

One immediate application will be for automatically identifying the subtly-changed clones -- whether that’s the addition of text or a border, or a slight overall blurring or crop -- of already-known violating images. ”The challenge here is we have very, very, very high thresholds, because we don't want to accidentally take anything down, you know, adding a single “not” or “no” or “this is wrong” on this post completely changes the meaning of it,” he continued.

Memes continue to be one of the company’s most vexing hate speech and misinformation vectors, due in part to their multi-modality nature. Doing so requires a great deal of subtle understanding, according to Schroepher. “You have to understand the text, the image, you may be referring to current events and so you have to encode some of that knowledge. I think from a technology standpoint, it's one of the most challenging areas of hate speech”

But as RIO continues to generate increasingly accurate classifiers, it will grant Facebook’s moderation teams far more leeway and opportunity to enforce the community guidelines. The advances should also help moderators more easily root out hate groups lurking on the platform. “One of the ways you'd want to identify these groups is if a bunch of the content in it is tripping our violence or hate speech classifiers,” Schropfer said. “The content classifiers are immensely useful, because they can be input signals into these things.”

Facebook has spent the past half decade developing its automated detection and moderation systems, yet its struggles with moderation continue. Earlier this year, the company settled a case brought by 11,000 traumatized moderators for $52 million. And earlier this week, moderators issued an open letter to Facebook management arguing that the company’s policies were putting their “lives in danger” and that the AI systems designed to alleviate the psychological damage of their jobs is still years away.

“My goal is to continue to push this technology forward,” Schroepfer concluded, “so that hopefully, at some point, zero people in the world who have to encounter any of this content that violates our community standards.”