Let's consider the case where where generate a number of reports, and we'll order them according to some metric of their information content and specifically how easy or possible it is to re-identify the original sources.
Consider the system below, we collect from a user their user ID, device ID and location - this is some kind of tracking application, or for that matter, any kind of application we typically have on our mobile devices, eg: something for social media, photo sharing etc...
We've taken necessary precautions for privacy - we'll assume there's notice and consent given - in that the user's data is passed using a secure channel into our system. Process of this data takes place and we generate two reports:
- The first containing specific data about the user
- The second using some anonymous ID associated with certain event data for logging purposes only. This report is very obviously anonymous!
For additional security purposes we'll even restrict access to the former because it contains PII - but the second which is anonymous doesn't need such protection.
In many cases this is considered sufficient - we've the notice and consent and all necessary access controls and channel security. Protecting the report or file with the sensitive data in it is a given. But now the less sensitive data is often forgotten in all of this:
- How is the identifier generated?
- How granular is the time stamp?
- What does the "event" actually contain?
- Who has access?
- How is this all secured?
Is the identifier some compound of data, hashed and salted, for example:
salt = "thesystem";id = sha256( deviceId + userid + salt);
This would at least allow analysis over unique user+device combinations and the salt, if specific to this logfile or system, then restricts matching to this log file only. Assuming of course the salt isn't know outside of here.
The timestamp is of less importance but if of very high granularity would prevent the sequencing of events.
The contents of the event are always interesting - what data is stored there? What needs to be and how? If this is some debug log then there's probably just as much here as there is in the report containing the PII. Often it might just be stack traces (with or without parameters), or memory dumps - both of which contain interesting data, even if it is just a pointer to where a weakness in the system might exist.
Now come the questions of who has access and how is this secured? Given that such a report has interesting content shouldn't this be as secure as the report containing specific and identifiable user data? If there's some shared common knowledge could rainbow tables of hashes etc be constructed?
Consider this situation:
Where two separate systems exist, but there exists a common path between these systems which can be exploited because access control wasn't considered necessary for such "low grade", non-personal data.
Any common path is the precursor to de-anonymisation of data.
This might seem to be a rather trivial situation, except that such shared access and common knowledge of things such as salts, keys etc exist in most companies, large and small. In the latter it is often hard to avoid. Mechanisms such as employee contracts and awareness training actually do very little to solve this problem as they aren't designed to address or even understand this problem.
And here lies the paradox of access control: while we guard reports, files, datasets containing PII, we fail to address the same when working with anonymous data - whatever anonymous means.