Apple's use of 'differential privacy' is necessary but not new

Google has been employing the same statistical technique that promises to keep your data private.

Gabrielle Lurie/AFP/Getty Images

Toward the end of Apple's WWDC keynote in San Francisco this week, senior VP of software engineering Craig Federighi switched gears from stickers and bubble effects to talk about a particular kind of privacy that would enable "crowdsourced learning" while keeping people's information "completely private."

In keeping up with the company's newfound image as a proponent of people's privacy, Federighi first pointed out that Apple does not build user profiles. He briefly mentioned end-to-end encryption before alluding to the privacy challenges of big data analysis, which is essentially the key to improving features and product experiences for most any tech company. The quick buildup led to the announcement of a solution: "differential privacy."

Against the backdrop of a major keynote address, unfamiliar techniques tend to sound new and revolutionary. But differential privacy is a mathematical technique that's been around for a few years within the statistical field. "It's a [robust and rigorous] definition of privacy that allows us to measure privacy loss," Cynthia Dwork, the co-inventor of differential privacy and a scientist at Microsoft Research, told Engadget. "It says that the outcome of any analysis is essentially the same independent of whether any individuals opt into the database or opt out. The same things are learned whether or not you chose to allow your data to be used for the study. The intuition is that if you couldn't be hurt if you didn't participate then you pretty much cannot be hurt if you do participate."


Apple senior VP of software engineering Craig Federighi during the keynote on Monday. Photo credit: Gabrielle Lurie/AFP/Getty Images

Within the context of Apple, a differentially private algorithm will allow its data analysts to glean trends –- like the most popular emoji and words -– from large datasets, but it wouldn't reveal identifiable information about any particular participant. To that end, starting with macOS, the company will start employing the technique and adding "mathematical noise to a small sample of the individual's usage pattern," according to an Apple representative. "As more people share the same pattern, general patterns begin to emerge, which can inform and enhance the user experience." This is expected to improve QuickType predictions and emoji and deep-link suggestions.

At least in theory, differential privacy is considered to be one of the most accurate privacy-preserving data techniques within the academic world. According to the defining literature on the subject -- a book co-authored by Dwork and Aaron Roth, a computer science professor at the University of Pennsylvania who was quoted on stage at WWDC –- the premise of differential privacy is a guarantee:

"Differential privacy describes a promise, made by a data holder, or curator, to a data subject:​ 'You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.' At their best, differentially private database mechanisms can make confidential data widely available for accurate data analysis, without resorting to data clean rooms, data usage agreements, data protection plans, or restricted views. Nonetheless, data utility will eventually be consumed: the Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way. The goal of algorithmic research on differential privacy is to postpone this inevitability as long as possible."

With increased ability to electronically collect and curate incredibly large datasets, the need to find appropriate algorithms that can prevent the destruction of privacy is even stronger. As an ad hoc solution, researchers and companies have turned to anonymization, where the data is stripped of specifics like names and email addresses. But selective scrubbing has not been enough to keep individuals unidentifiable and has left people vulnerable time and again.

In the University of Pennsylvania's introduction to differential privacy, Roth explains that vulnerability with an example: "At one point, it was shown that an attack on Amazon's recommendation algorithm was possible," he says. "If I knew five or six things you bought on Amazon, I could buy those same things, and all of a sudden, we're now the two most similar customers in Amazon's recommendation algorithm. I could then start seeing what else you were buying, as whatever you bought would then be recommended to me."

"In differential privacy nobody actually looks at raw data. There is an interface that sits between the data analyst and the raw data and it ensures that privacy is maintained." -- Cynthia Dwork, the co-inventor of differential privacy

Differential privacy was invented to tackle that precise problem. The algorithm, which potentially protects people from online attacks, is designed to deliberately add noise to the numbers. It's based on a popular surveying technique called "randomized response" where people are asked if they engaged in any illegal activities. Dwork gives an example of a surveyor who calls to find out whether an individual cheated on an exam. But before responding, the person is asked to flip a coin. If it's heads, the response should be honest but the outcome of the coin shouldn't be shared. If the coin comes up tails, the person needs to flip a second coin; if that one is heads, the response should be "yes." If the second is tails, it's "no."

The research technique doesn't let the surveyor know if the answer was truthful or simply a random outcome based on the coins. "There's a statistical hint," says Dwork. "But you can't tell for sure if the truth was a yes or a no. Statisticians know how to reverse engineer these noisey numbers and pull out the approximate of how many people were cheating."​ The same applies to datasets, where the yeses or trends can be understood. With more people in a study or a dataset, the proportional errors shrink dramatically. The errors don't disappear entirely, but the technique provides an approximation that's rooted in mathematical evidence.

That kind of statistical validation makes the technique well-suited for technology companies that rely heavily on data analysis. But its adoption has been slow until now. Dwork believes that one of the reasons for the sluggishness is that privacy hasn't always been a priority for people who work with very large sets of personal data. "The risks to privacy were less well understood than they are now," she says. "Also I think people who were used to working with data, like medical surveys, etc. ... were used to looking at raw data. In differential privacy nobody actually looks at raw data. There is an interface that sits between the data analyst and the raw data and it ensures that privacy is maintained. People who have a certain training that taught them how to analyze data didn't necessarily know how to work with this new model."

A quote from Aaron Roth at WWDC this week

The technique, and the required expertise, is still a work in progress. Apple's announcement to adopt it as a tool for machine learning and gathering statistics takes it from theory to practice. But Apple isn't the first to have that idea. Google has already been using it for its RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) project for the last couple of years. It allows the company to find out which websites are most popular with people when they launch the Chrome browser. "What they do, very roughly speaking, is get a report from the individual browser that has already had differential privacy rolled into it," says Dwork. "It gives a statistical hint about where people are going without actually revealing for sure who is going where."

Beyond privacy, the flexibility of the technique makes it desirable. It goes from scientific research to technology companies. But perhaps the biggest selling point of the algorithm is that it's good at its job. "You don't actually want something that is good at predicting what people have bought historically," says Roth in a paper. "You want something that predicts what they are going buy tomorrow."

Despite its computational power, differential privacy has similar limitations to other privacy-preserving methods. "Within the [tech landscape], the challenges revolve around the trade off between what can be done and accepting the fundamental truth that 'overly accurate estimates of too many statistics is non-private'," says Dwork. "I think if you're interested in privacy, sometimes restraint might be the right approach."