Saturday, 27 June 2015

An article on The Semantics of PII


A while back I wrote a short article for the IAPP's Privacy Tech Blog. With permission I'll reproduce it here for additional reference. Also,a tip of the hat for the administrator of the blog: Jedidah Bracy of the IAPP for his spell checking, grammar checking and editorial skills!


The Semantics of PII
Privacy Tech | Feb 26, 2015


Last year, Profs. Peter Swire and Annie Antón wrote a compelling piece in Privacy Perspectives about the need for privacy engineers and lawyers to get along. Establishing a common language in which to communicate will be essential to appropriately connect policy with technology.

It’s probably safe to say that the most common terms used in privacy are personally identifiable information (PII) and personal data, depending upon whether you come from a U.S. or European background. I think these terms are more or less self-explanatory.

But what do they really mean?

Take PII, for example. It means a chunk of data that reveals some knowledge about a person that can be unambiguously identified. Sounds more or less about right, doesn’t it? Is a computer's IP address personally identifiable? What if that IP address belongs to a router for a large, multinational corporation? Is it PII then? And what if it belongs to a family using multiple computers, tablets, phones or other devices?

We will soon find ourselves delving into the minutiae of meaning—the what-does-personal-really-mean type questions. Plus, we must ask what isinformation, and what does identifiable denote?

There is a whole area of linguistics, philosophy and mathematics—take your pick—that deals with the meaning of things, otherwise known as semantics, or even semiotics if you want the overall field.

Mathematicians took years to fully understand the semantics of even simple statements such as 1+1=2, which looks obvious until you try to explain what 1 is, what 2 is, what + means, what = means and then what it means to say 1+1. The English philosophers Bertrand Russell and Albert Whitehead spent most of their careers writing Principia Mathematica to answer this question, and after four editions and 300 pages of dense mathematics, they had an answer. That was, until a young German by the name of Kurt Gödel came along and shook mathematics to its foundations with an equally "trivial" result.

So if it took 300 pages by two of the brightest minds in mathematics to give us a semantics for 1+1=2, how many pages—and years of work—will it take to give "PII" a semantics?

Now here's an interesting point: The definition of PII that is used in contemporary privacy is perfectly well defined in the privacy-legal context. I can go to various legal documents and read a formal definition of what PII or personal data means. But as we move between disciplines—in our case from privacy-legal to privacy-engineering disciplines—these definitions no longer hold, or at the very least, they don't work well.

If we move to the other end of the scale from legal to mathematics, we find concepts such as information entropy, which provides a clear, unambiguous and precise definition of what information is as well as the identifiability of a data set with respect to some population and so on. Information entropy, however, is not an easy concept with which to work. We can state now that the legal definition of PII can be defined in terms of the mathematical definition; it's just that this is obscenely difficult to do.

Somewhere between these two extremes lies software engineering, the discipline that actually implements privacy law into our systems, in ostensibly mathematical (programming language) terms.

Software engineers, much to the chagrin of privacy lawyers, do not understand legal terms. Well, ok, they do to a point, but you try coding a statement such as "reasonable privacy" into C++ or Java!

Plus, privacy lawyers don't understand all the subtle ramifications of virtual machines, machine language, object orientation, distributed computing, network protocols, XML, RDF—the list goes on!—again, much to the chagrin of software engineers.

Yet, as we stated earlier, there is a relationship between the terms and language that privacy lawyers use and the terms and language that software engineers use. That link provides the translation mechanism that allows both groups not just to talk but to properly communicate with each other.

We can spend as much time as we’d like writing manifestos and principles, designing processes, inventing new job titles such as privacy officer, privacy compliance tsar, grand chief-overseer-of-the-worshipful-court-of-privacy-dudes and so on, but without grounding semantics into terms such as PII and personal data—terms that will allow us to translate between legal-speak and engineer-speak—all of this work will be in vain.

No comments: