I took part as one of the speakers in a presentation about analytics today; explaining how data is collected through instrumentation of applications, web pages etc, to an audience who are not familiar with the intricacies of data collection and analytics.
We had a brief discussion about identifiers and what identifiers actually are which was enlightening and hopefully will have prevented a few errors later on. This bears explaining briefly: an identifier is rarely a single field, but should be considered any one of the subsets of the whole record. There are caveats there of course, some fields can't be used as part of some compound identifier, but the point here was to emphasis that you need to examine the whole record not just individual fields in isolation.
The bulk of the talk however introduced from where data comes from. For example if we instrument an application such that a particular action is collected, then we're not just collecting an instance of that action but also whatever contextual data provided by the instrumentation and the data from the traffic or transport layer. This came as a surprise that there is so much information available via the transport/traffic layers:
Furthermore data can be cross-referenced with other data after collection. A canonical example is geolocation over IP addresses to provide information about location. Consider the case where a user switches off the location services on his or her mobile device; location can still be inferred later in the analytics process to a surprisingly high-level of accuracy.
If data is collected over time, then even though we are not collecting specific latitude-longitude coordinates we are collecting data about movements of a single, unique human being; even though no `explicit' location collection seems to be being made. If you find that somewhat disturbing, consider what happens every time you pay with a credit card or use a store card.
Then of course there's the whole anonymisation process where once again we have to take into consideration not just what an identifier is, but the semantics of the data, the granularity etc. Only then can we obtain an anonymous data set. Such a data set can be shared publicly...or maybe not as we saw in a previous posting.
Even when one starts tokenising and suppressing fields, the k-anonymity remains remarkably low, typically with more than 70% of the records remaining unique within that dataset. Arguments about the usefulness of k-anonymity notwithstanding - on the other hand it is one of the few privacy metrics we have,
So, the lesson here is rather simple, you're collected a massive amount more than you really think.