Machine learning systems constantly learn from data about people to optimise themselves. For instance, voice assistance learns and improves from constant interaction and feedback with the user. Text messengers learn from keyboard inputs to predict useful suggestions.
But what if inputs contain very sensitive information or even secrets? Are they also learned and do they remain in the system? What if learned systems are shared. Can others ask them to reveal my secrets?
Learned systems learn from private data
A learned system is able to learn how to identify a human face by looking at thousands of different faces. It creates its own internal model that generalises the features of a human face. This trained model is then able to recognise faces it has not seen before.
Many learned systems are trained on what the General Data Protection Regulation (GDPR) classifies as personal data. For instance, health records are used to train models that can predict the probability of a patient developing cancer. The data controller has to assure confidentiality of the health records used to train the model.
Can we learn private data from learned systems?
Popular deep learning methods try to train models that provide high generalisation while also aiming at highly accurate predictions. That is why whenever it sees very rare data which is not generalisable, it starts to memorise it. Like very personal data, for example.
Whenever training data contains private information, the question arises whether the trained model has memorised any of it. And is it possible for others to access private data just by interacting with the model?
Learned systems share secrets
Nicholas Carlini et al. have shown that it is possible to extract private data from trained models. They published one of the most advanced data recovering techniques quite recently, which even made it possible to extract credit card numbers from a model, without prior knowledge of the training data.
The researchers developed a method to measure the extent to which individual secrets are exposed by the model. Unfortunately, this tool also helped them to extract these secrets.
What are the consequences for GDPR?
Michael Veale, Reuben Binns and Lilian Edwards investigated whether machine learning models should be considered personal data under GDPR. They can be shared between companies or even publicly with few regulatory concerns. With more sophisticated methods that allow recovery of training data from a model, this gets complicated and messy.
Guaranteeing the ‘right of access’ would require the model to explain its decisions and how private data was involved. However, the behaviour of complex deep learning models is still very difficult to explain.
To comply with the ‘right to be forgotten’, models would be required to unlearn knowledge obtained from private data. This is a new technical challenge and solutions are still in an experimental stage.
The ‘right to restrict’ is also difficult to implement. A query against a model might be seen as querying the complete training set used to create the model.
As these new insights cause uncertainty about their consequences on GDPR, they will cause even more confusion for people interacting with learned systems.
It is important to find new ways to tackle this problem while focussing on people’s needs. A very promising approach to achieve this is differentially-private learning. It does not require a change in how people currently use learned systems.
In my next post, IF will show how to train differentially-private models to protect people’s privacy and make sure that their secrets remain secret.
If you’d like to hear more about our work, email us at firstname.lastname@example.org.
Come work at IF! Our open roles are here: https://projectsbyif.workable.com/