Tuesday, 4 June 2013

Data Collection

Anyone who cares about privacy will tell you that data collection is bad...REALLY BAD, yet without data collection none of the services we use and need would work (there's an economic argument there). Indeed most of the issues around data collection seems to be very emotive in nature and usually end up in the "data collection is bad/must be minimsed panic".

First to understand data collection we must understand from where data is collected. First of all we need to understand the relationship between a client and a server via some infrastructure. This is described in an earlier article which talks about primary and secondary data.

Generally the case we mainly worry about in privacy is where the client is a human interacting with an app or browser via some device, eg: a mobile phone.

The end-user (customer or human) will provide some information required to fulfill the functions of the service required, eg: posting a photograph to Flickr with some description and their login details.

So far this seems reasonable - afterall, a user is posting a photograph to their account with a description. The service might then perform an number of other tasks
  • Set the current time and date of upload
  • Extract EXIF information from the photograph, including (but not limited to):
    • date and time of the shot
    • location information
    • camera information
    • free text strings, eg: copyright/ownership information
All of the above comes under the notion of data collection and all of the above is information provided by the user of the service. One can argue that the user didn't know about the EXIF contents of the photograph and surprisingly few people actually realise how much information is embedded into the picture.

Further operations can be performed such as facial recognition and linking this to persons who have been tagged in other photographs etc. For example, this is a service provided by Facebook.

Additionally the service maybe collecting secondary information about the usage of the service, for example, how much time was spent using a particular page, what the UI/UX flow was and so on.

Then there's the information collected from the infrastructure. This includes information provided by the browser or app, typically in the form of browser identification strings, application identifiers, device identifiers, and from the service infrastructure including such information as source IP address, contents of the API call, time and date of interaction, error codes etc.

When we talk about privacy we typically end up focussing on the primary and secondary information sets. The former set to decide whether that information is necessary for the service to function and provide the facilities the user requires and the second set to decide whether this information needs to be explicitly collected or not. Rarely do we even investigate the information collected by the infrastructure which can be used both to recreate primary and secondary data sets. Indeed one of the major targets for any hacker are the infrastructure logs themselves.

When we talk about data collection we must consider the following data sets:

Primary, secondary and infrastructure are described as above with the addition that we split primary into explicit and implicit where the former denotes that information which is explicitly understood by the end-user and implicit including that information that might be hidden, eg: EXIF. Combined together we produce the Total Information Set which when linked together by simple association becomes the Total Deduced Information Set.

Another way of looking at this is in a more Venn diagrammatic manner:

Where the labels are abbreviations of the corresponding labels in the first diagram and colours for emphasis.

It must be noted that some intersections might not properly exist, for example the union of EP and Sec, and others only in certain contexts, for example, whether IP is a proper superset of EP or not.

Now we have explicitly modelled data collection, we can now start tackling questions such as what does minimisation of data collection actually mean and then start looking at what the processing and extraction functions over that data might actually look like.

1 comment:

Andrew James said...

Thanks for sharing a great post about data collection, if you need to hire the best in the business, visit http://www.infinitdatum.com/.