Friday, 1 November 2013

Measurement and Metrics for Information Privacy

We have already discussed an analogy for understanding and implicitly measuring the information content over a data channel. The idea that information is an "infectious agent" is quite a powerful analogy in the sense that it allows us better to understand the consequences of information processing and the distribution of that data better, viz:
  • Any body of data which is containing certain amounts and kinds of sensitive data we can consider to be non-sterile
  • Information which is truly anonymous is sterile
  • Mixing two sets of information produces a single set of new information which is as at least as unclean as the dirtiest set of data mixed, and usually more so!
  • The higher the security classification the dirtier the information
Now we can using our information classification earlier introduced we can further refine our understanding and get some kind of metric over the information content.

Ley us classify information content into seven basic categories: financial, health, location, personal, time, identifiers and content. Just knowing what kinds of data are present as we have already discussed gives us a mechanism to pinpoint where more investigation is required.

We can then go further an pick out particular data flows for further investigation and then map this to some metric of contamination:


For example, transporting location data has a particular set of concerns, enough to make any privacy professional nervous at the least! However if we examine a particular data flow or store we can evaluate what is really happening, for example, transmitting country level data is a lot less invasive than highly accurate latitude and longitude.

Now one might ask why not deal with the accurate data initially? The reasons are that we might not have access to that accurate, field-level data, we might not want to deal with the specifics at a given point in time, specific design decisions might not have been made etc.

Furthermore, for each of the seven categories we can give some "average" weighting and abstract away from specific details which might just complicate any discussion.

Because we have a measure, we can calculate and compare over that measurement. For example, if we have a data channel carrying a number of identifiers (eg: IP, DeviceID, UserID) we can take the maximum of these as being indicative of the sensitivity of the whole channel for that aspect.

We can compare two channels, or two design decisions, for example, a channel carrying an applicationID is less sensitive (or contaminated) than one carrying device identifiers.

We also can construct a vector over the whole channel composed out of the seven dimensionsb above to give a further way of comparing and reasoning about the level of contamination or sensitivity:
| (FIN=0,HLT=0,LOC=10,PER=8,TIM=3,ID=7,CONT=0) | 
   < 
| (FIN=3,HLT=2,LOC=4,PER=4,TIM=2,ID=9,CONT=2) |
for some numerical values gien to each category. Arriving at these values will be specific to a given wider context and then the weighting given to each, but there is one measure which can be used to ground all this, and that is of information entropy, or, how identifiying the contents are to a given, unique human being. A great example of this is given at the EFF's Panopticlick pages.

We've only spoken about a single data flow at the moment, however the typical scenario is for reasoning over longer flows, for example, we might have our infrastructure set up as below*

In this example we might look at all the instances where AppID and Location are specified together and use a colour coding such that:
  • Black: unknown/irrelevant
  • Red: high degree of contamination, both AppID and Location unhashed and accurate respectively
  • Yellow: some degree of contamination, AppID may be hashed(+salt?) or Location at city level or better
  • Green: AppID randomised over time, hashed, salted and Location at country level or better
Immediately readable from the above flow are our points of concern which need to be investigated, particular the flows from the analytics processing via the reports storage and to direct marketing. It is also easy to see that there might be a concern with the flow to business usages, what kinds of queries ensure that the flow here is less contaminated than the actual reports storage itself?

There are a number of points we have not yet discussed, such as that some kinds of data can be transformed into a different type. For example some kinds of content such as pictures inherently contain device identifiers, locations etc. Indeed the weighting for some categories such as content might be very much higher than that of identifiers for example - unless the investigation is made. Indeed it does become almost a trivial exercise for some to explicitly hide sensitive information inside opaque data such as images and not declare then when a privacy audit is made.

To summarise, we've a simple mechanism for evaluating information content over a system in quantative terms, complete with a refinement mechanism that allows us to work at many levels of abstraction depending upon situation and context. Indeed, what we are doing is explcitly formalising and externalising our skills when performing such evaluations, and through analogies such as "infection control" providing means for easing the education of professionals outside of the privacy area.

*This is not indicative of any system real, living or dead, but just an example configuration.

No comments: