There are very few areas that highlight the fundamental
conflict between security and usability better than data anonymisation. It’s a conflict that’s only going to get
worse as more and more data is collected about us all as individuals. A recent blog post by Bruce Schneier expresses his concerns over the Internet of Things and the vast increase of
personal data available for collection once our cars, fridges, televisions,
smart meters et al are all on the Internet and ‘measuring’ our usage for
reporting back to their vendors or service providers. Unfortunately it’s the onward step from there
where the vendors and service providers sell on that data to ‘selected
partners’ that the real loss of privacy, and gain of unsolicited opportunity to
partake in exclusive new offers, starts to kick in. It wouldn’t be so bad if it was only the
corporate types looking to pimp details of our personal activities and opinions
to all and sundry; unfortunately we also have the UK government looking to
‘maximise the value’ of the data that it holds about us. For example, by releasing individual records of pupil attainment to enable industry to release it’s full vigour and
expertise and create lots of wonderful new tools and products to improve
education in the UK. Just don’t ask to see the actual business
case – it’s so obvious that this is the way it would work that it’s simply not
worth asking the question… But don’t
worry, all such data, be that sourced from the Internet of Things or HMG
databases that we have no choice but to populate, will be anonymised before
being released. And that’s when the
problem hits…
First off, let me start by explaining what I mean by
anonymisation, and it’s counter-process de-anonymisation. The process of anonymisation is supposed to
render it infeasible for an attacker to identify a real-world identity from a
dataset. Conversely, de-anonymisation is
the process of identifying a real-world identity from anonymised data. For the purposes of this blog I’m not going
to explore the concept of pseudonymisation as it’s not directly relevant – but
feel free to Google the term (irony).
So, how can you safely anonymise data? There are a number of techniques available,
from the highly noddy (removal of obviously personally identifying information
such as names and addresses) through to the more mathematically valid (but
still imperfect) approaches such as k-anonymisation, t-closeness and
l-diversity. Most organisations seem to
fall somewhere in the middle and use techniques such as data substitution, data
perturbation, data aggregation and data suppression. Now, the best way to safely anonymise data
(imho) is to aggregate data and suppress small numbers. This means that you don’t actually release
data on individuals but rather release data on groups and suppress (i.e. strip
out) data that would identify small groups of people (e.g. 5 or less). Good examples of aggregated data include the
school performance tables that provide details of exam pass-rates. However, data aggregates are not good for
analysis of individuals – and that makes it of little use for those
organisations looking to pitch to those customers with most interest in their
products or services. The naïve will
think that simply stripping out name and address and other obviously
identifying information will make their data safe for re-use. This is demonstrably wrong as illustrated by
the cases of AOL, NetFlix
and the more recent research on the uniqueness of mobile telephony data. Let’s take this last case as an example of
the issue. The study in question showed
that it can only take 4 geographical data points within the mobile telephony
dataset to identity an individual.
There are no names in the data, no other obvious means of
identification, just location data. I’ll
give you an example of how that’s a privacy risk. I live near one major city but my current
client is based in another, some 2.5 hours away by train. This means that every day there will be at
least three data points regarding my mobile phone usage – my home address, my
local Train Station and my client office.
Tie this together with my regular visits to my Taekwondo classes and
that’s pretty much me identified – I’m the only one visiting the location of my
Taekwondo class that also regularly visits my client offices! Of course, in order to resolve my identity
in this way, you need to know my work patterns and that I do Taekwondo –
nothing that a nosy neighbour or friend on Facebook would find difficult to
discover. The privacy risk at this
point is that, once they have identified my unique identifier (either a genuine
unique identifier or simply a unique recurring relationship of elements within
the data) within the dataset they can then start to increase the information
that they know about me – for example, they may now spot the football ground
that I attend or, more worryingly, should I be ill and start visiting the
hospital on a regular basis this would also become apparent to those with whom
I have no wish to share that information…
This takes me back to the fundamental premise of this blog –
the conflict between security and usability.
The only way to truly anonymise data is to destroy the relationships
within the data that enable a knowledgeable attacker to identify the individual. There are two problems with this:
i)
you will rarely know the exact set of data, and
therefore relationships, known to an attacker.
Consider the ‘nosy neighbour’ threat actor; my own neighbours know our
names, address, birthdays, the cars that we drive, the names of our kids and
the school(s) that they go to.
Furthermore, they know our hobbies and the hobbies of our kids, they
know where we grew up and who we work for…
That’s an awful lot of useful information to discriminate between
individuals within a dataset [by the way, we have lovely neighbours and so,
personally, I’m not too worried.]
ii)
it’s the relationships that give the data value! If you perturb, substitute or otherwise
mangle the data then you risk losing the relationships in which you are most
interested. For example, if the data
collector is a retail outlet and they mangle their dataset by including random
purchases from other shoppers within my loyalty card history (so as to make it
less obviously me) then they risk starting to target me with offers that I have
absolutely no interest in and driving me towards alternative retailers.
Now, there is guidance available (e.g. the ICO document
entitled Anonymisation: managing data protection risk code of practice)
and there are tools in the market to help organisations to anonymise their
data, however you really need to understand what you are doing to get the most
benefit out of such tools. More
important than the how though is the why.
Organisations should ensure that they have a rock solid business case
outlining genuine, evidenced, benefits to both themselves and the data subjects
before releasing anonymised datasets to the public or to their business
partners. A failure to develop such a
strong business case will leave organisations highly exposed should their
anonymised dataset not be as anonymous as they thought – the data subjects may
not be amused to find that their intimate personal details have been made
available to all simply because someone thought it may… possibly… potentially…
be useful. Remember this, once the data
has gone, the data has gone; there is no way to put this particular genie back
in it’s bottle.