Thursday 27 September 2012

Teaching Privacy

It often surprises me that many of the people advocating privacy don't actually understand the things that they're trying to keep private, specifically information. Indeed the terms data and information are used interchangeably and there is often little understanding of the actual nature and semantics of said, data and information.

I've run courses on data modelling, formal methods, systems design, semantics and now privacy - the latter however always seems to be "a taster or privacy" or "brief introduction to privacy" and there rarely is the chance to get into specifics about what information is.

This of course has some serious implications and one of the best I can find is when we talk about anonymisation. I've seen horrors such as statements "if you hash this identifier, then it is anonymous" or "if we randomise this data then we can't track" or lately, "if we set this flag to '1' then no-one will track you anymore". In the first case I refer people back to the AOL Data Leak and the dangers of fingerprinting, semantic analysis and simple cross-referencing.

I made a study a while back based on the leak of 16,000 names from various Finnish education organisations (plus maybe other places). It was very interesting to see that even with the released list that contained dates of birth and last names how many were already unique, and even in the cases where there existed common Finnish names how easy it was to trace these back to a unique person. Actually going to the next step and verifying this with that person would I guess have been somewhat illegal or if not, unethical to say the least. Social engineering would have been very easy in many of these cases I'm sure.

So given cases like these and the current dearth of educational material I though it would be nice to try to put together a more comprehensive and deeper set of material. Some universities are already doing this and there also exist industrial qualifications such as those by the IAPP, however at this stage all ideas are welcome.

Now I want to specifically address a technical audience: software engineers, computer scientists - the people who end up building these systems because that's where I feel much breaks down - for many reasons but I won't appoint blame here - that's not really constructive in the current context.

First of all I want to break things down into 3 logical segments, actually there are 4 but I'll discuss that one later:
  • Legal
  • Consumer Advocacy
  • Technical
 and address each area individually.

Legal is relatively straightforward in that an understanding of principles of privacy, how various jurisdictions view data, information, anonymisation, cross-referencing, children and minors, cross-border data transfer, retention and data collection and a discussion of certain practices, eg: EU, US, China, India etc. This discussion doesn't have to be heavy but an understanding of what the law states and how it interprets things is critical. Also from here we should get an understanding of how the law affects the engineering side of things: common terminology as a good example.

Consumer advocacy is really the overview material in my opinion - what are the principles of privacy, for example Cavoukian's Privacy by Design as an example (even if I'm not happy with the implementation of these), how to consumers view privacy, what is the reality (say vs do) and also various case studies such as how consumers view Google, Apple, Nokia, Facebook, various Governments, technologies such as NFC, mobile devices, 'Smart Televisions', direct marketing and advertising, store cards etc. Out of this comes an understanding of how privacy is viewed and even an appreciation of why we don't get privacy: anti-privacy if you like.

The technical aspect takes in many technologies, rather than describe, I'll list them (and this will be non-exhaustive and in no particular order)
  • Basic Security - Web, Encryption, Hashing, Hacking (XSS etc), authentication (OpenID, OAuth etc), differences/commonalities between privacy and security, mapping privacy problems into security problems as a solution
  • Databases - technologies, design, schema development (eg: relational theory), "schema-less" databases, cross-referencing, semantic isolation
  • Semantics - ontologies, classifications, aspect, Semantic Web
  • Data-flow
  • Distributed Systems - networking and infrastructure
  • API design - browsers, apps, web-interfaces, REST
  • Data Collection - primary vs secondary vs infrastructure, logging
  • Policy - policy languages, logic, rules, data filtering
  • Anonymisation - data cleansing
  • Identifiers - tracking, "Do Not Track"
  • User-Interface
  • Metrics for privacy - entropy
  • Information Types and Classification - location, personally identifiable information, identifiers, PCI, health/medical data
as you can see the list is extensive and an understanding of each of these areas is critical to building systems that honour and preserve privacy in its various forms (as described in the consumer advocacy and legal sections). The main point here is to provide software engineers and computer scientists with the tools to implement privacy in a meaningful manner.

Now that we have outlined the three areas we can look at the fourth which binds these together and which I tentatively call "Theory of Privacy".

Obviously something binds these areas together and there does exist a huge body of work on the nature of information and its classifications. I particularly like the approach by Barwise and Seligman in the 1997 book Information Flow: The Logic of Distributed Systems*. I believe we can quite easily get into all sorts of interesting ontology, semantics and even semiotic discussions. Shannon's Information Theory and notions of entropy (eg: Volkstein's book: Entropy and Information) are fundamental to many things. I think this really is an area that needs to be opened up and addressed seriously and anything that binds together and provides a common language to unify consumer advocacy, the law and software engineering is critical.

Finally, no outline of a course would be complete with some preliminary requirements and a book list. For the former an understanding of computer systems and basic computer security is a must (there is no privacy without security), a grounding in software engineering techniques and a dose of computer science similarly. For the books, my first draft list would include:
  • Barwise, Seligman. Information Flow
  • O'Hara, Shadbolt. The Spy in the Coffee Machine: The End of Privacy as We Know It
  • Solove. Understanding Privacy
  • Nissenbaum. Privacy in Content: Technology, Policy, and the Integrity of Social Life
  • Solove: The Future of Reputation: Gossip, Rumour, and Privacy on the Internet

*somebody should make a movie of this.

1 comment:

Unknown said...

I work every day as a programmer on a quantitative stock analysis team that manages billions of dollars in assets. That position has given me a lot of insight into how such a professional quantitative stock analysis process works. See more data analysis qualitative research