Unique in the Crowd: The privacy bounds of human mobility by de Montjoye, Hidalgo, Verleysen & Blondel examines call traces for ~1.5 million mobile phone users, gathered “from April 2006 to June 2007 in a western country.” The traces recorded the nearest antenna and time whenever a voice or text message was sent or received. Stripped of caller information but still grouping all uses of the same phone within one trace, this data could be used to uniquely re-identify the callers.
Four randomly chosen points are enough to uniquely characterize 95% of the users (ε > .95), whereas two randomly chosen points still uniquely characterize more than 50% of the users (ε > .5). This shows that mobility traces are highly unique, and can therefore be re-identified using little outside information.
Worse, coarsening the trace data by grouping trace points within adjacent antenna regions or consecutive hours is a surprisingly ineffective safeguard. Such coarsening is easily counteracted by examining a few additional trace points.
The power-law dependency of ε means that, on average, each time the spatial or temporal resolution of the traces is divided by two, their uniqueness decreases by a constant factor ~ (2)−β. This implies that privacy is increasingly hard to gain by lowering the resolution of a dataset.
The dependence of β on p [the number of trace points] implies that a few additional points might be all that is needed to identify an individual in a dataset with a lower resolution. In fact, given four points, a two-fold decrease in spatial or temporal resolution makes it 9.3% less likely to identify an individual, while given ten points, the same two-fold decrease results in a reduction of only 6.2%.
The association of several times and locations in a supposedly anonymous data set constitutes a unique “fingerprint” that can be mapped back to the original user with little outside data. Consider smartphone users whose public tweets are tagged with their physical locations. Once a sequence of times and locations for the same device is available, a search through Twitter timelines will require only a few matches to uniquely identify its owner. That identification can now be used to establish the owner’s physical location at any other time in the same trace, or in any future trace for the same device – even when the owner does not broadcast his location at those times.
Do these findings have any concrete privacy implications? That’s actually hard to say. Phone services do of course already know their customers’ names and addresses. Hackers would probably target these systems directly, and state authorities can simply order a release of any desired data. The more important lesson is the surprising uniqueness of behavioral data in general: doing just about anything that’s being recorded can later identify a person.
As an aside on academic publishing, the paper was submitted last October, then spent a full four months in review, and after being accepted took almost another two months to publish. Half a year delay – on an Internet venue boasting “rapid review and publication!”