Monday, 17 November 2014

A Definition of PII and Personal Data

There's been an interesting discussion on Twitter about the terms PII and "personal data", classification of information and metrics.

Personally I think the terms "PII" and "personal data" are too broadly applied. Their definitions are poor at best; when did you last see a formal definition of these terms? Indeed classifying a data set as PII only comes about from the types of data inside that data set and by measuring the amount of identifiability of that set.

There now exists two problems in that a classification system underneath that of PII isn't well established in normal terminology. Secondly metrics for information content are very much defined in terms of information entropy.

Providing these underlying classifications is critical to better comprehending the data that we are dealing with. For example, consider the following diagram:

Given any set of data, each field can be mapped into one or more of the seven broad categories on the left - If we wanted we could create much more sophisticated ontologies to express this. Within each of these we can specialise more and this is somewhat represented as we move horizontally across the diagram.

Avoiding information entropy as much as possible, we can and have derived some form of metric to at least assess the risk of data being held or processed. A high 'score' means high risk and a high degree of reidentification is possible, while a low score the opposite - though not necessarily meaning that there is no risk. Each of the categories could be further weighted such as using location is twice as risky as financial data.

There could be and are some interesting relationships between the categories, for example, identifiers such as machine addresses (IPs) can be mapped into personal identifiers and locations - depending upon the use case.

I'm not going to go into a full formalisation of the function to calculate this, but a simple function which takes in a data set's fields and produces a value, say in the range 0 to 5 to state the risk of the data set might suffice. A second function to map that value to a set of requirements to handle that risk is the needed.

What about PII?  Well, to really establish this we should go into the contents of the data and the context in which that data exists. Another, rather brutal way, is to draw a boundary line across the above diagram such that things on the right-hand-side are potentially PII and those on the left not. This might then become a useful weighting metric, that if anything appears to the right of this line then the whole data set gets tagged with being potentially PII. I guess you could also become quite clever in using this division line to normalise the risk scoring across the various information classifications.

In summary, we can therefore give the term PII (or personal data) a definition in terms of what a data set contains rather than using it as a catch-all classification. This allows us then to have a proper discussion about risk and requirements.


Ian Oliver. Privacy Engineering: A Data Flow and Ontological Approach. ISBN 978-1497569713

No comments: