Wednesday, 22 May 2013

Anonywhat?


 
There are very few areas that highlight the fundamental conflict between security and usability better than data anonymisation.  It’s a conflict that’s only going to get worse as more and more data is collected about us all as individuals.  A recent blog post by Bruce Schneier expresses his concerns over the Internet of Things and the vast increase of personal data available for collection once our cars, fridges, televisions, smart meters et al are all on the Internet and ‘measuring’ our usage for reporting back to their vendors or service providers.  Unfortunately it’s the onward step from there where the vendors and service providers sell on that data to ‘selected partners’ that the real loss of privacy, and gain of unsolicited opportunity to partake in exclusive new offers, starts to kick in.   It wouldn’t be so bad if it was only the corporate types looking to pimp details of our personal activities and opinions to all and sundry; unfortunately we also have the UK government looking to ‘maximise the value’ of the data that it holds about us.  For example, by releasing individual records of pupil attainment to enable industry to release it’s full vigour and expertise and create lots of wonderful new tools and products to improve education in the UK.  Just don’t ask to see the actual business case – it’s so obvious that this is the way it would work that it’s simply not worth asking the question…   But don’t worry, all such data, be that sourced from the Internet of Things or HMG databases that we have no choice but to populate, will be anonymised before being released.   And that’s when the problem hits…
 
First off, let me start by explaining what I mean by anonymisation, and it’s counter-process de-anonymisation.   The process of anonymisation is supposed to render it infeasible for an attacker to identify a real-world identity from a dataset.  Conversely, de-anonymisation is the process of identifying a real-world identity from anonymised data.   For the purposes of this blog I’m not going to explore the concept of pseudonymisation as it’s not directly relevant – but feel free to Google the term (irony).
 
So, how can you safely anonymise data?  There are a number of techniques available, from the highly noddy (removal of obviously personally identifying information such as names and addresses) through to the more mathematically valid (but still imperfect) approaches such as k-anonymisation, t-closeness and l-diversity.  Most organisations seem to fall somewhere in the middle and use techniques such as data substitution, data perturbation, data aggregation and data suppression.   Now, the best way to safely anonymise data (imho) is to aggregate data and suppress small numbers.  This means that you don’t actually release data on individuals but rather release data on groups and suppress (i.e. strip out) data that would identify small groups of people (e.g. 5 or less).   Good examples of aggregated data include the school performance tables that provide details of exam pass-rates.  However, data aggregates are not good for analysis of individuals – and that makes it of little use for those organisations looking to pitch to those customers with most interest in their products or services.   The naïve will think that simply stripping out name and address and other obviously identifying information will make their data safe for re-use.  This is demonstrably wrong as illustrated by the cases of AOL, NetFlix and the more recent research on the uniqueness of mobile telephony data.   Let’s take this last case as an example of the issue.  The study in question showed that it can only take 4 geographical data points within the mobile telephony dataset to identity an individual.   There are no names in the data, no other obvious means of identification, just location data.  I’ll give you an example of how that’s a privacy risk.  I live near one major city but my current client is based in another, some 2.5 hours away by train.  This means that every day there will be at least three data points regarding my mobile phone usage – my home address, my local Train Station and my client office.  Tie this together with my regular visits to my Taekwondo classes and that’s pretty much me identified – I’m the only one visiting the location of my Taekwondo class that also regularly visits my client offices!   Of course, in order to resolve my identity in this way, you need to know my work patterns and that I do Taekwondo – nothing that a nosy neighbour or friend on Facebook would find difficult to discover.   The privacy risk at this point is that, once they have identified my unique identifier (either a genuine unique identifier or simply a unique recurring relationship of elements within the data) within the dataset they can then start to increase the information that they know about me – for example, they may now spot the football ground that I attend or, more worryingly, should I be ill and start visiting the hospital on a regular basis this would also become apparent to those with whom I have no wish to share that information…
 
This takes me back to the fundamental premise of this blog – the conflict between security and usability.  The only way to truly anonymise data is to destroy the relationships within the data that enable a knowledgeable attacker to identify the individual.  There are two problems with this:
 
i)                    you will rarely know the exact set of data, and therefore relationships, known to an attacker.  Consider the ‘nosy neighbour’ threat actor; my own neighbours know our names, address, birthdays, the cars that we drive, the names of our kids and the school(s) that they go to.  Furthermore, they know our hobbies and the hobbies of our kids, they know where we grew up and who we work for…  That’s an awful lot of useful information to discriminate between individuals within a dataset [by the way, we have lovely neighbours and so, personally, I’m not too worried.]
ii)                  it’s the relationships that give the data value!  If you perturb, substitute or otherwise mangle the data then you risk losing the relationships in which you are most interested.  For example, if the data collector is a retail outlet and they mangle their dataset by including random purchases from other shoppers within my loyalty card history (so as to make it less obviously me) then they risk starting to target me with offers that I have absolutely no interest in and driving me towards alternative retailers.
 
Now, there is guidance available (e.g. the ICO document entitled Anonymisation: managing data protection risk code of practice) and there are tools in the market to help organisations to anonymise their data, however you really need to understand what you are doing to get the most benefit out of such tools.   More important than the how though is the why.  Organisations should ensure that they have a rock solid business case outlining genuine, evidenced, benefits to both themselves and the data subjects before releasing anonymised datasets to the public or to their business partners.   A failure to develop such a strong business case will leave organisations highly exposed should their anonymised dataset not be as anonymous as they thought – the data subjects may not be amused to find that their intimate personal details have been made available to all simply because someone thought it may… possibly… potentially… be useful.  Remember this, once the data has gone, the data has gone; there is no way to put this particular genie back in it’s bottle.