Machine learning models have to memorize sensitive information

BY: GINA MANTICA

You might’ve encountered a machine learning model while typing an email or text that starts to automatically fill in the rest of a word, phrase, or sentence. Machine learning models for sentence prediction can not only complete sentences, but they can also analyze sentence structure and predict grammar. However, Hariri Institute Research Fellow Adam Smith, along with BU researchers Gavin Brown and Mark Bun and colleagues at Apple, found that models for problems like this cannot function without memorizing data that is oftentimes sensitive.

Their paper was recently accepted by the 2021 Association for Computing Machinery (ACM) Symposium on Theory of Computing.

Machine learning models extract relevant information from examples, or training set data, and produce a prediction algorithm based on that information. But the most accurate models sometimes memorize what seems like irrelevant information, even at the expense of an individual’s privacy. Smith’s findings suggest that the model’s performance and an individual’s right to privacy can be fundamentally at odds.

Adam Smith, Hariri Institute Research Fellow and Professor in Computer Science, discovered that memorization is necessary for machines to learn.

Through a detailed set of theoretical problems and mathematical proofs, Smith and colleagues find that any training algorithm with a limited-size data set must store the complete details of several data points to perform well. Smaller data sets are more likely to contain many outliers—data points that are very different from the rest of the data set. Each such data point is rich in information, but it’s hard for the model, given only one of them, to know which aspects of the data point are worth remembering and which are not. The model therefore must remember lots of extraneous details about the data in order to make accurate predictions.

However, if there is a lot of training data, the machine learning model can achieve the same level of accuracy without memorizing as much individual information about data points. This is because larger training sets likely contain many examples of any given kind, allowing the model to hone in on and retain only essential features. Ironically, the more information the model is given, the less specific detail the model needs to retain.

The researchers’ work suggests that machine learning models have to memorize all the information about outliers, in particular, for models to make accurate predictions–getting rid of any information increases errors. “We showed that this phenomenon is unavoidable,” said Smith, “There is no way of training machine learning models that gets around memorizing specific examples of the training set data.”

As machine learning models are implemented widely, their utility will need to be weighed carefully against their risk to privacy.

Interested in learning more about the transformational science happening at the Hariri Institute? Sign up for our newsletter here.