Wednesday 8 February 2012

A hierarchy of privacy?

I've been wondering about privacy and its hierarchy. Specifically after reading the article Hierarchy and Emergence by David Corfield on The n-Cateogry Cafe (July 18, 2008),  do we have a good hierarchy of privacy?

In that article, Corfield presents a list of hierarchies, eg:

sound, vocable, word, utterance, conversation, discourse

and discusses things such as while mathematics deals with the lower levels, eg: the theory of sound waves, it does not handle the upper levels well. Corfield goes on to discuss whether mathematics can explain symphonies etc.

So, with that as a starting point, can we do something for privacy and in particular, information privacy and its relationship with data. Would a hierarchy such as:

field, type, object/table/class, database, corpus ....

work or be sensible? I'm not suggesting the above is strictly correct or the only way of defining this, but we need a start and this seems vaguely sensible at the moment

Certainly, mathematics works well at the lower level here...how do we deal with privacy at a field level? Typically, encryption, hashing and various other obfusication techniques. We know what a data field is, how to define it, how to manipulate it and what a field means in various semantic defintions.

At a type level we know the transformations and semantics similarly. For an example of information transformation - street being mapped to city - we have ontologies and consistent mappings that work well here.

There's even a nice metric over these as well: entropy, though typically this manifests itself as a probability distribution rather than a specific value that can be applied - though an ordering relationship obviously exists and this is isomorphic to the subclass relationship found in most ontologies.

At an object/class/table level things get more complicated as individual fields have relationships with other fields which complicates things. A trival example here might be that one obfusicates the date of birth but doesn't obfusicate the starsign and age fields. Both fields have a relationship though in this case the entropy does increase but the amount of information loss is less than one might perceive just by dealing with a field or fields in isolation.

Here I think we already start seeing problems...I don't think really that mathematics has a problem per se but rather the complexity of the solution starts to emerge and we don't readily know how to deal with this, except maybe in an abstract sense in that we can define the properties and structures but not the actual inner workings consistently. Even just the simple act of defining what an object is is fraught with difficulty. One can look at the work on RDF Molecules [1] to see an example of how difficult it can be in defining the boundary of the concept of an object. I'd actually love to perform some analysis on using the expressivity of description logics to provide the various bounds of an object and then combine this with a power law relationship to give probability bounds, eg: this chunk of data constitutes 95% of what I would consider to be the core of some object - aside to self: need to write that down fully.

How does privacy work at the database level; what do databases look like after processing for privacy...as we have seen with AOL [2] and Netflix, there are interesting issues when a database is "anonymised" and then combined with other databases.

Finally, what happens at the corpus level? Definitions get very weak here and maybe not amentiable to our traditional, formal mathematical treatment.

In the comments to the original Corfield's hierarchy article, there is a good point made in that mathematics can describe every level in the hierarchy equally well; but this doesn't mean that mathematics currently explains the links between the levels (hint: composition and aggregation are not easy) nor do we have a good model that explains the hierarchy at all. This might imply that our hierarchy is either wrong, we've missed levels or not decomposed the concepts well.

The main thing here is that I've written this down, even if it is in very early draft form...maybe this'll help with the writer's block...so, A theory of privacy anyone?

References

[1] Tim Finin. RDF molecules and lossless decompositions of RDF graphs
[2] Andrew Orlowski. AOL publishes database of users' intentions. The Register 7 Aug 2006.

2 comments:

Ora said...

Good start, keep going.

BTW, the word is spelled "obfuscate". ;-)

Ian said...

I obfuscated it ;-)