Tuesday, 11 June 2013

Privacy, Data Collection and Surveillance



The privacy debate about the collection of data by the NSA continues with many asking questions about the moral and ethical issues surrounding this. The phrase "the death of privacy" is abound.
This is true I'm afraid, we lost our privacy, but not when the NSA starting collecting data but when we starting communicating using technologies that were readily and easily available - that probably dates back to the birth of written communication.

Data collection concerns me certainly, but here I want to focus on one of the maxims of privacy: "if you don'tuse it, don't collect it" and the fact that privacy is much more about the usage of data, not its collection (viz. the above maxim).

One can argue that merely using Google, Facebook and all the rest of the social media services one has already lost one's privacy, but interaction with these services is voluntary - no-one forced you to post those party pictures to the entire World and dog (complete with EXIF and location information). 

We admittedly do have a problem with other more hidden aspects of data collection and processing, for example with infrastructure and derived data.

In the above respects we have not lost privacy but moved the bounds of what personally and socially we call privacy – obviously people are not placing emphasis on the moral and ethical issues but rather on the economic benefit of using such data consuming services. In writing this blog I am losing my privacy, but with the economic gain of brand building and knowledge sharing.

Using this data consumers and users can be profiled and classified; typically for the serving of the perfect advertisement. However this is not unlike what an "old style shopkeeper" did through personally knowing his customers. The major difference is that today this is done automatically and impersonally by computer. We lost the link with that corner shop keeper who knew us and our families personally. Ever try contacting the customer service departments of practically any company these days?

This also touches on the point that users start or have started to feel that they are not in control of their data.

Most advertising and profiling companies are using classification structures that are fairly coarse grained but then further refined those with additional [coarse] grained data such as location and social network. This for the most part is nothing more than could be understood by reflecting on one's own life, place of abode and neighbourhood. For the most part this is just reasserting what is already derivable from a person’s postcode.

Much of the data collected by the NSA in the current revelations is somewhat innocuous; primarily this seems to be just telephone record meta-data like the kind you see on an itemized bill. But such innocuous data can easily be cross-referenced and fingerprinted.

The trouble here is that government authorities can have a more insidious effect upon a person's life than a supermarket or credit card provider can. Indeed there are safe guards and protections through the rule of law - though as we have seen these can be constructed so that under some circumstances the law can allow whatever is necessary to get a/the job done.

Before however we dismiss the above, consider two points:

  1. automatic guilt, or, guilty until proven innocent
  2. scope creep

The first derives from the fact that all your actions may be used against you in the future. If you think you have nothing to hide then consider all the crimes you committed today? Did you drive over the speed limit, run a red light, have you ever stolen something/anything etc?

The second derives from the first that once you have this information then it could be used for purposes well beyond its original intent. Worse are the twin possibilities of false positives and false negatives. Consider councils in the UK using CCTV cameras originally intended to catch terrorists and prevent crime (in general) for catching dog owners not cleaning up after their dogs.

From the above the moral and ethical arguments are easily fashioned, the economic arguments are much more difficult and vary depending upon the context and our view of what society should be:

  • Is personal freedom, privacy and liberty greater than that of society's?
  • Is mass surveillance better than letting one "terrorist" commit an act of atrocity?

These questions however go right to the heart of the definitions of freedom, liberty, privacy, security, society and our own control over our own data. I don't think any of us even remotely comprehend the repercussions and difficulties of even trying to address, let alone answer such questions. 

But until we start having this debate in an impartial, focused and formal manner with the terms and definitions clearly stated, judging and/or condemning any form of data collection and any form of processing and usage of data is not going to be possible in any meaningful, lasting manner.

In another way we're back to a question posted by a group of mathematicians regarding the esoteric nature of things as we move away from the fundamental building blocks, and losing sight of what those building blocks [of society and humanity] actually mean.

Whether the NSA and everyone else's collection of data is right or wrong I can't answer, but the debate about what privacy actually is and our relationship personally and as a society with the concepts of privacy, security and trust is going to be an extremely interesting debate with wide repercussions.

Monday, 10 June 2013

Privacy, Continuity and Performance...

Of the "big four" non-functional aspects of a system: security, privacy, continuity and performance, typically privacy and security are viewed together; there is no doubt about their relationship.

The relationship between privacy (more generally information management) and continuity and performance is much more subtle. Certain decisions in these areas have an effect upon the information management aspects. Consider a piece of middleware that for both continuity and performance reasons
  1. batches incoming data (for later processing or sending to some other system)
  2. caches authentication data (for "fast(er)" login)
Both require data to be held for a certain period of time and data to be removed. There are also implications for the storage of that data in terms of whether it is secured through some means (eg: encrypted file system, database, fields...) and the internal processing and communication mechanisms.

We are primarily concerned with minimising the amount of data held and avoiding a single point of failure which would allow access to all the data. We have three basic options:


The monolithic system has potentially greater performance characteristics, but less so with regards to continuity and privacy - these latter two having a single point of failure. The facade while providing a single API decreases the performance but potentially facilitates better continuity through decoupling internally the authentication and data-handling and the decoupled system places much more responsibility onto the client for handling the correct calling sequences but better deals with privacy by reducing the amount of available data via any one API and component.

However as we decouple the system we increase the amount of inter-component communication and introduce a different set of information management and continuity issues, such as securing these data-flows and the leaky abstraction of network/communication failures.

The point here is not to provide a definitive answer of whether one solution is better than another but to emphasise the subtle interaction between privacy, continuity and performance in differing architectural solutions.

Tuesday, 4 June 2013

Data Collection

Anyone who cares about privacy will tell you that data collection is bad...REALLY BAD, yet without data collection none of the services we use and need would work (there's an economic argument there). Indeed most of the issues around data collection seems to be very emotive in nature and usually end up in the "data collection is bad/must be minimsed panic".

First to understand data collection we must understand from where data is collected. First of all we need to understand the relationship between a client and a server via some infrastructure. This is described in an earlier article which talks about primary and secondary data.

Generally the case we mainly worry about in privacy is where the client is a human interacting with an app or browser via some device, eg: a mobile phone.

The end-user (customer or human) will provide some information required to fulfill the functions of the service required, eg: posting a photograph to Flickr with some description and their login details.

So far this seems reasonable - afterall, a user is posting a photograph to their account with a description. The service might then perform an number of other tasks
  • Set the current time and date of upload
  • Extract EXIF information from the photograph, including (but not limited to):
    • date and time of the shot
    • location information
    • camera information
    • free text strings, eg: copyright/ownership information
All of the above comes under the notion of data collection and all of the above is information provided by the user of the service. One can argue that the user didn't know about the EXIF contents of the photograph and surprisingly few people actually realise how much information is embedded into the picture.

Further operations can be performed such as facial recognition and linking this to persons who have been tagged in other photographs etc. For example, this is a service provided by Facebook.

Additionally the service maybe collecting secondary information about the usage of the service, for example, how much time was spent using a particular page, what the UI/UX flow was and so on.

Then there's the information collected from the infrastructure. This includes information provided by the browser or app, typically in the form of browser identification strings, application identifiers, device identifiers, and from the service infrastructure including such information as source IP address, contents of the API call, time and date of interaction, error codes etc.

When we talk about privacy we typically end up focussing on the primary and secondary information sets. The former set to decide whether that information is necessary for the service to function and provide the facilities the user requires and the second set to decide whether this information needs to be explicitly collected or not. Rarely do we even investigate the information collected by the infrastructure which can be used both to recreate primary and secondary data sets. Indeed one of the major targets for any hacker are the infrastructure logs themselves.

When we talk about data collection we must consider the following data sets:

Primary, secondary and infrastructure are described as above with the addition that we split primary into explicit and implicit where the former denotes that information which is explicitly understood by the end-user and implicit including that information that might be hidden, eg: EXIF. Combined together we produce the Total Information Set which when linked together by simple association becomes the Total Deduced Information Set.

Another way of looking at this is in a more Venn diagrammatic manner:

Where the labels are abbreviations of the corresponding labels in the first diagram and colours for emphasis.

It must be noted that some intersections might not properly exist, for example the union of EP and Sec, and others only in certain contexts, for example, whether IP is a proper superset of EP or not.

Now we have explicitly modelled data collection, we can now start tackling questions such as what does minimisation of data collection actually mean and then start looking at what the processing and extraction functions over that data might actually look like.