First to understand data collection we must understand from where data is collected. First of all we need to understand the relationship between a client and a server via some infrastructure. This is described in an earlier article which talks about primary and secondary data.
The end-user (customer or human) will provide some information required to fulfill the functions of the service required, eg: posting a photograph to Flickr with some description and their login details.
So far this seems reasonable - afterall, a user is posting a photograph to their account with a description. The service might then perform an number of other tasks
- Set the current time and date of upload
- Extract EXIF information from the photograph, including (but not limited to):
- date and time of the shot
- location information
- camera information
- free text strings, eg: copyright/ownership information
Further operations can be performed such as facial recognition and linking this to persons who have been tagged in other photographs etc. For example, this is a service provided by Facebook.
Additionally the service maybe collecting secondary information about the usage of the service, for example, how much time was spent using a particular page, what the UI/UX flow was and so on.
Then there's the information collected from the infrastructure. This includes information provided by the browser or app, typically in the form of browser identification strings, application identifiers, device identifiers, and from the service infrastructure including such information as source IP address, contents of the API call, time and date of interaction, error codes etc.
When we talk about privacy we typically end up focussing on the primary and secondary information sets. The former set to decide whether that information is necessary for the service to function and provide the facilities the user requires and the second set to decide whether this information needs to be explicitly collected or not. Rarely do we even investigate the information collected by the infrastructure which can be used both to recreate primary and secondary data sets. Indeed one of the major targets for any hacker are the infrastructure logs themselves.
When we talk about data collection we must consider the following data sets:
Another way of looking at this is in a more Venn diagrammatic manner:
It must be noted that some intersections might not properly exist, for example the union of EP and Sec, and others only in certain contexts, for example, whether IP is a proper superset of EP or not.
Now we have explicitly modelled data collection, we can now start tackling questions such as what does minimisation of data collection actually mean and then start looking at what the processing and extraction functions over that data might actually look like.