In our previous post, we outlined how important mobility data is to ensure equitable access to transport. We also highlighted that data about how individuals move is sensitive. Privacy versus data collection for service improvement is often presented as a choice of one against the other. While that’s true in some cases, privacy preserving techniques can be used to achieve both for many real world problems.
In this post, we’ll introduce the technique of randomised response, one of the simplest privacy preserving techniques. We’ll look at how it can be applied to a real world problem - tracking the number of hired bikes passing through each neighbourhood in a city. We created a Python notebook to accompany this post implementing the techniques we describe.
Start with simulated data when possible
To build our bike counting code, we need some trip data to work with. Real trip data is sensitive, even when stripped of direct identifiers. To minimise the amount of data our code is exposed to, we should avoid using real trip data until there’s no alternative. We chose to simulate bike trip data for our exploration, to avoid exposure.
Simulating data, particularly during the development phase of software, allows us to reduce exposure to sensitive data. Simulating data often has advantages beyond privacy. You can improve testing by generating a range of datasets designed to trigger edge cases, or generating data that is more detailed than data in a real system. For example, we generate precise bike trip routes. But we could reduce them to a set of noisy, intermittent, simulated GPS reports to evaluate a route inference algorithm against the data we originally generated. Simulated data also aids transparency. It removes a significant barrier to opening systems for others to inspect as there are no privacy concerns to address..
To simulate our bike trip data, we first need a city for the bikes to travel within. We chose to generate a fake city, using the City Chef library from Arup’s City Modelling Lab. The library allows us to randomly create clusters of homes and workplaces, and connect them with a simple grid-like street pattern. The simplicity of the generated form allows us to quickly explore ideas, while risking that those ideas may fail in more complex real world scenarios. If we needed more realistic data, we could have chosen a more complex simulator. For example the Shared Streets simulator that generates bike trips in real world street network, or even MATSim, which can model the entire transportation system of large cities. When generating simulated data, you need to consider the tradeoffs between the precision and depth of the data you’d like to generate, and how complex the simulator needs to be to generate it.
Simulators have been used in the development of many systems that would otherwise need access to sensitive data during development. The DeepMind Health team simulated an active hospital to aid development of their clinical application, Streams. The OpenSAFELY project provides clinical researchers with a way to declare the structure of data they intend to work with in a way that both lets their infrastructure generate simulated data matching those expectations for development, and connect their analysis code to real data when running in a secure research environment.
We used City Chef to generate random home and workplace locations, and a street network between them. We also used it to generate a set of neighbourhoods of roughly equal population. We then picked one thousand random home and workplace locations, and found the shortest path between them through our street network. These paths became our simulated bike trips.
How randomised response works
Randomised Response is a simple privacy preserving technique that allows you to make estimates about sensitive characteristics of a dataset, without ever having access to sensitive data points. Let’s say we wanted to estimate how many people in a group illegally rode bikes in pedestrian space to avoid dangerous roads, without ever recording which specific individuals broke the law. Before answering the question, each individual flips a coin. If the coin shows tails, they answer truthfully, in the case of heads, they answer ‘yes’. We know that approximately half of the group will answer ‘yes’ independently of whether they used a scooter, as the coin will show heads around half the participants. The real number of people who say they didn’t use a scooter is therefore around twice the number of people who say they didn’t. We can estimate how many people did use a scooter by subtracting this estimate for the number of people who didn’t from the number of participants. Crucially, the most likely explanation for anyone answering ‘yes’ is that the coin showed heads, so we can’t infer anything about the behaviour of any one individual.
It’s important to realise that while we can never be sure whether a ‘yes’ answer is truthful or not, ‘no’ answers always are. To protect privacy, questions need to be phrased such that the ‘yes’ answer is sensitive, and the ‘no’ answer isn’t. If such a phrasing isn’t possible, other techniques would need to be used.
Values calculated through randomised response will only be an estimate of the true value This is due to error introduced by the coin toss, and the assumption that approximately half the group will say ‘yes’ because of the coin. Instead of asking people to answer ‘yes’ half the time, we could shift that proportion higher or lower, like asking people to answer ‘yes’ 70% of the time for example. A higher proportion would lead to a less accurate estimate because fewer people are telling the truth, but better privacy protection, and vice versa.
One way of looking at randomised response is that it provides privacy by adding noise to data. There’s a balance between the amount of noise added and the level of privacy protected provided. This idea has been rigorously explored, and given a formal mathematical underpinning with Differential Privacy, a more recent and complex set of privacy preserving techniques. Differential Privacy was motivated by the realisation that even seemingly anonymous metrics can leak information about individuals when accurate datasets are combined, even with aggregate, rather than individual level, data. Differential privacy is complicated to implement in practice but provides robust guarantees against reidentification by attackers, irrespective of what data they have access to. While reidentification attacks, such as that against the New York Taxi trip dataset shown in the last post, seem convoluted, machine learning techniques are designed to find surprising correlations in datasets.
While randomised response is a simple technique, it’s the basic building block of RAPPOR, an approach used by the Chrome web browser to detect malware in a differentially private way. In RAPPOR, coin flips are used to decide whether or not a bit should be set in a Bloom Filter.
How we applied randomised response to simulated trip data
In our example we’d like to count the number of simulated bike trips passing through each one of our neighbourhoods, without recording where a given bike travelled. To do this, we first determine the neighbourhoods each bike has passed through.
We then add noise to those neighbourhoods through randomised response. For each bike and neighbourhood, we then ask the bike “Did you travel through this neighbourhood?”.
The result is noisy data about the neighbourhoods visited by each bike, that doesn’t allow us to infer anything about the actual trip taken by each bike. At this point, we could delete sensitive data about the path travelled by each bike away. In a real system, we might never record this sensitive data, but instead directly record the noisy data.
From the noisy data, we can derive estimates for the number of bikes passing through each neighbourhood - the number bypassing the neighbourhood is twice the number reporting they bypassed.
While we count the number of bikes passing through each neighbourhood, any geographic partitioning could be used, from street segments to areas in which the parking bikes is not permitted.
Implications for system design
Using privacy preserving techniques changes how people can use data. Organisations that hold data previously regarded as too sensitive to share could open data sets to wider audiences. This makes it possible for people to use data in new ways. This is particularly useful for when organisations cannot balance the risk of working with sensitive data with their business models. Once protected, organisations can use datasets for a range of different analyses. Our neighbourhood counts, for example, could be grouped by date, time, or another attribute as bike level records are maintained. Some analyses, however, won’t be possible. Once precise location data is erased, it cannot be recovered if, for example, we wish to count the number of bikes traversing a particular street, rather than in a neighbourhood. Privacy preserving techniques need to be applied to a specific question, rather than just a dataset. This goes against opportunistically collecting data in the hope that it will be useful in the future. The need for privacy preserving techniques to be applied to specific problems is one reason why specificity forms one of our four pillars for responsible data systems.
In our next post, we’ll talk about some of the use cases that could be enabled through the use of privacy preserving techniques on mobility data.