Friday, 25 January 2013

On The Naivety of Privacy

Recent events regarding privacy and the internet have left me wondering if we are being somewhat naïve. We are starting to see a slew of new laws, strategies and technologies for protecting our privacy in what is effectively a public space. The end-user however is not, as far as I can tell, really getting the benefit of this - indeed if anyone is it is the emerging privacy-industrial complex [1] as some have written.

It is utterly naïve to believe that laws, strategies, intentions, grand speeches, certifications, automatic filtering, classification iconography and so on make for better end-user privacy. The more we do this the more confused we become, and simultaneously we lose sight of what we're really trying to achieve. Spare a thought for the poor end-users.

There is a great deal that is misunderstood or not known by privacy advocates about how the internet, computers and information systems work - I fear in a lot of cases either some don't want to understand because it takes them outside of their comfort zone, or the semantic gap between the engineers and the legal/advocacy side is too great and that bridging this gap is extraordinarily difficult for both parties.

I worry about our lack of formality and discipline, possibly in equal quantities. We – the privacy community – lack these aspects to really understand and accept the fundamentals of our area and how to apply this to the information systems we are trying to protect. In some cases we are actively fighting against the need to scientifically ground our chosen area.

We must take a moment to think and understand what problem we are really trying to solve. The more philosophical works by Solove and Nissenbaum address the overall concept of privacy. I'm not sure that the implications of these are really understood. Part of the problem is that general theories of information [2] are very abstract and obtuse when compared with the legal views of the above authors, and we've done very little to tie these areas together to produce the necessary scientific formalisation of privacy we need.

As an example, the Privacy by Design (PbD) manifesto is being waived by many to be the commandments of privacy and following these magically solves everything. This only leads to “technical debt” and greater problems in the future. Often we find the engineers, R&D teams and the scientists excluded from, and outside of, this discussion.

I think we're missing the point what privacy really is and certainly we have little idea at this time how to effectively build information systems with inherent privacy [3] as a property of those systems. I have one initial conclusion:


We have no common definitions, common language, common semantics nor mappings between our individual worlds: legal, advocacy and engineering. Worse, in each of these worlds terminology and semantics are not always so well internally defined.

When an [software] engineer says “data collection”, "log" or "architecture", these do not mean the same to a lawyer or a consumer advocate. Indeed I don't think these terms semantically map even remotely cleanly – if at all - between these groups.

A set of PowerPoint slides with a strategy, a vision, a manifesto, good intentions, project plan or a classifications scheme mean very little and without some form of semantics are wasted, token efforts that only add to the complexity and confusion of a rapidly changing field.

We desperately need to address the problem that we must create a way of communicating amongst ourselves through which all of the internal factions within the privacy community can effectively understand each other's point of view. Only then might we even have a chance of realistically and effectively addressing the moving target of privacy issues facing end- users and businesses that rely so much on the interchange and analysis of information.

The problem with formally (or rigorously) defining anything is that it has the nasty tendency to expose holes and weaknesses in our thinking. Said holes and weaknesses are not entirely appreciated, especially when it challenges an established school of thought or a political or dogmatic balance [4].

The privacy community is constantly developing new laws and legal arguments, new sets of guidelines, manifestos and doom scenarios while the engineers are trying to address these often inconsistent and complex ideas through technical means. From the engineering perspective not only we are internally exposing flaws in database design, information system architecture and user experience but also the mismatch between engineering, legal, the world of the consumer advocate and ultimately a company's information strategy.

An information strategy needs to address everything from how the engineers develop software to how you want your company to be perceived by the consumer. How many information strategies actually address the role that information plays in a modern, global consumer ecosystem where the central concept is the collection and processing of user information? Of those, how many address the engineering and scientific levels of information?

We must take a serious retrenchment [5] step and look back at what we have created. Then we need ruthlessly carve away anything that does not either solve the communication issue within the privacy community or does not immediately serve the end-user. Reemphasizing the latter point, this explicitly means the end-user values, not what we as a privacy community might perceive to be valued by the end-user.

We must fully appreciate the close link between privacy and information, and that privacy is one of the most crosscutting of disciplines. Privacy is going to expose every single flaw in the way we collect, manage, process and use information from the user experience, as well as the application and services eco-system, and even the manner in which we conduct our system and software engineering processes and information governance. The need to get privacy right is critical not just for the existence of privacy as a technical discipline in its own right (alongside security, architecture, etc) but also for the consumer and the business.

The emphasis must be placed on the deep technical knowledge of experts in information systems – these must be the drivers and unifiers between the engineers, the lawyers, the advocates and ultimately the users. Without this deep, holistic, scientific and mathematical foundation we will not be able to sufficiently nor consistency address or govern any issues that arise in the construction of our information systems at any level of abstraction.

If the work we do in privacy does not have a scientific, consistent and formal underpinning [6] that brings together the engineers, lawyers and advocates then privacy is waste of time at best and deeply destructive to the information systems at worst.

Without this we are a disjointed community caring for ourselves and not the business or consumer and privacy becomes just a bureaucratic exercise to fulfill the notions of performing a process and metrics rendered as meaningful as random numbers.

* * *


Via Twitter I came across a talk given by Jean Bezivin entitled "Should we Resurrect Software Engineering?" presented at the Choose Forum in December 2012. Many of the things he presented are analogous to what is happening in privacy. He makes the point a number of times that we have never addressed the missing underlying theory of software engineering and how to really unify the various communities, fields and techniques in this area. Two points I particularly liked was that we concentrated on the solution (MDE) but never thought about the problem; the other point is the use Albert Camus' quote

<< Mal nommer les choses, c'est ajouterau malheur du monde >>
[To misname things is to add misery to the world]
A subtle hint to getting the fundamentals right: terminology and semantics!


[1] #pii2012: The Emergent Privacy-Industrial Complex

[2] Jerry Seligmann, Jon Barwise (1997) Information Flow. Cambridge University Press.

[3] I like the idea of privacy being an inherent construct in system design in much the same way that inherent safety emerged from chemical/industrial plant design

[4] A blog article discussing “mathematical catastrophes” – two that come to mind are Russel and Frege and also Russel and Gödel. Both related but the latter’s challenge to the mathematical school of thought was dramatic to say the least.

[5] A formal retrenchment step in that we not just start again but actively record what we’re backtracking on. Poppleton constructed a theory of retrenchment for software design using formal methods; the same principles apply here.

[6] If you’re still in doubt just remember that whatever decisions are made with respect to privacy, there’s a programmer writing formal, mathematical statements encoding this. Lessig’s Code is Law principle.

Wednesday, 23 January 2013

Kubler-Ross and Getting Ideas Accepted

When discussing new or challenging ideas, or even anything that challenges or even questions the existing schools of thought (or business process!) there is often much "push-back" with responses such as "that'll never work", "impossible" etc...sometimes even when confronted with the evidence and demonstration.

Dealing with this is often soul destroying from the innovator's perspective and getting past this is 90% of the challenge of getting new ideas and view points accepted. So having a mechanism to understand the responses would be useful. I think the Kubler-Ross model might be useful here to examine people's responses.

The model itself was developed for psychologists to understand the process of grief. While the model has sparked some controversy, this does not detract from the basic principle of the model. The model consists of five sequential stages:
  1. Denial - "we're fine", "everything works"
  2. Anger - "NO!"
  3. Bargaining - "Ok, so how do you fix this?"
  4. Depression - "Why bother...?", "Too difficult"
  5. Acceptance - "Let's do this!!!"
When applied to challenging ideas, the person rejecting those ideas has to proceed through the above stages - and the challenger has to also acknowledge this and work within this.

Let's say we have a process and metrics for some purpose - the process is complicated and dogmatic, the metrics measure some completion rate but not effort or compliance. A challenge to this might be met with the following responses:
  1. Denial - the process works! No-one has complained! We have metrics! We're CMM Level 3!
  2. Anger - Why are you complaining? We don't need to change!
  3. Bargaining - OK, we'll consider your ideas and changes but we're not promising anything. Can you come up with a project plan, budget, strategy, PowerPoints etc...?
  4. Depression - OK, there are problems, but we can't deal with them. It's too late and complex to change. Let's create a project plan, strategy and vision. How can we ever capture those metrics?
  5. Acceptance - You're right, let's run with this
Actually the last state - acceptance - probably works very well in a more agile environment, but agility requires both a deep and holistic and ultimately an approach grounded in the theory of the subject at hand. Do not underestimate getting management support either, and conversely as a manger giving real support is similarly critical.

This model must be used in an introspective and reflective manner to ensure that you as the originator and presenter of the idea do not fall into the trap of stages 1 and 2 yourself. Understanding your reactions in the above terms is very enlightening regarding your own behaviour.

If you do reach stage 3 in the discussions then this is time that you need to be absolutely sure in how your idea works, what the flaws are and how it integrates and improves what came previously. At this stage you have the chance to get everyone on board but after this however it is extremely difficult to turn people to your idea.

Stage 4 is depression all round, you will probably have accepted many changes to your idea and let go of some cherished ideas. Worse is that you've probably challenged the existing school and dogma to such a degree you are going to get a lot of "push back" on the ideas. In some respects this is where ideas do die either "naturally" or through "suicide" to use some dark terminology. To get through this stage you need to be the supporter of everyone. Indeed emphasis on the previous school of thought as being the catalyst to the newer ideas is critical to get through this; after all, wasn't it the previous systems that sparked the need for change in the first place?

Stage 5 is requires real leadership of the innovation and building of the team to carry this forward. Like it not, teamwork and ensuring that everyone, even the detractors, have a voice is critical. Sometimes your challenge might free some of the original detractors out of their earlier beliefs - this can come as quite a relief to these people and offer them badly needed, new challenges and purpose.

There are many more things one could write on this and there are many books and theories on how to manage innovation and invention elsewhere. The idea here was to relate some experiences with the Kubler-Ross model and understand things in that context, which personally I've found to be a very useful tool.

Monday, 7 January 2013

Tutorial: Tracking

Tracking and anonymisation are two critical aspects of information privacy and often quite misunderstood. We established what is meant by Personally Identifiable Information (PII) earlier but now I wish to progress a little further and discuss identifiers, tracking and anonymisation of data sets.
  • A data set is a collection of records containing information. 
  • A record is usually made of up of a number of individually named fields, though it could be a more complex structure such as a graph or tree for example. 
  • Each field contains some data from something as simple as a binary value, to a name, a number, a time-stamp, a picture or a video etc.
  • These fields are usually typed, for example: string, boolean, integer, VARCHAR, blob, media, something from the dc: namespace etc. However a field containing, say, a string could be further interpreted as a telephone number or a name. Some typing systems make distinctions such as a field storing a string to be interpreted as a telephone number explicit, others this is left to the interpretation by the reader.
  • Some fields in the data set's records are used to identify either aspects of that record, to correlate records together or to link to some external data. These fields we term identifiers.
Tracking is the ability to correlate information; often made in conjunction with some criteria such as a temporal, device or user identity dimension. The correlation is made according to one or more fields which act as identifiers, for example, user ID fields or IP addresses. The point is that we have a consistent identifier (or key) over the sets of data that we wish to relate or consider together. For example, given the following data set collected from some music service:

Key UserID Artist
1 Alice Queen
2 Alice Queen
3 Bob Rush
4 Bob Rush
5 Eve Spice Girls
6 Alice Queen
7 Bob Genesis
8 Eve Metallica

From this log we might want to track user behaviour to understand what music a particular user of our system likes listening to: we can see that Alice likes Queen, Bob is a fan of progressive rock and Eve has varied taste in music. This is possible because we have a consistent identifier (UserID field)  that any two instances of an entry refer to the same entity - the user. Furthermore the Key field allows us to make a distinction between two instances containing the same information which enables us to count individual entries: Alice played three songs, Bob three and Eve two. Additionally the Key field in this example may also have a temporal dimension such that we can infer the order in which songs were played.

The only required property of the identifier is that it be consistent over the records we wish to track. So if we change the above identifiers to their SHA-256 representations ("Alice" becomes 3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e4e009ec8a0699a3043 ) we do not compromise our ability to track the behaviour of a user over that data set:

Key UserID (SHA-256 hashed) Artist
1 3bc5106...0699a3043 Queen
2 3bc5106...0699a3043 Queen
6 3bc5106...0699a3043 Queen

We can still make the same anlayses: 3bc...3043 likes Queen and played songs from that band three times. We have however obfuscated the user identifier, assuming that the user identifier had any meaning in the first place.

This latter point is important to note as it depends upon how we interpret the identifier. For example: 3bc5106...99a3043 has no "meaning" other than it being something we use to track over. The string "Alice" may have a meaning..."Alice" as 5 ASCII or Unicode characters are just as meaningful as our hashed value above. However "Alice" itself according to the typing information and usage in the data set is the identifier of the user in some system. Furthermore according to some interpretations "Alice" is a female name and this particular interpretation of this identifier's meaning might have additional impact.

In the above case we stated nothing about whether the strings "Alice", "Bob" and "Eve" actually were people's names nor whether these were linkable to real and unique people.We never really stated the semantics of the UserID field quite deliberately.

An example we can use to demonstrate this is that of the common practice of email-as-identifier. You can use your email address instead of (or as a) user name in  Facebook, G+ and other services. The following string can be interpreted in a number of (potentially) simultaneous ways:

  • a string of 25 ASCII characters
  • an email address of a person/company/entity
  • a user ID for some service
  • a unique identifier linkable to a real person
and so on...

From a tracking point of view, "zarquon.123@somesite.zyx" has just as much meaning as "Alice" or "3bc5106...0699a3043" etc.

We however are now moving into the interpretation of the contents of a field and the semantics beyond that of an identifier which itself is independent of the actual form of the identifier itself. This leads us to the notion of linkability which we shall discuss later.

Tracking as we have hinted can be made more sophisticated through the addition of other identifiers such as IP addresses, device identifiers etc but this just makes the partitioning of the data set more complex and expand the possible internal cross-correlations but doesn't change the basic principle of tracking.

Some identifiers are more useful than others and much of this depends upon how linkable an identifier is to a person or device. For example, device identifiers such as IMEI are particularly useful, email addresses link to persons, IP addresses link to sometimes single computer or devices, sometimes multiple and can also be mapped to locations through the process of geolocation.

So that very briefly introduces what tracking is: simply the ability to correlate and collate sets of data together. The next step is to perform specific analyses over that data and to map those results back to the business and customer.

Really teaching privacy

I'm often called to give tutorials on privacy - usually as part of the audit of some system. It is clear to me that our friends on the legal and consumer advocacy side of things have a monopoly in educating our developers. However this is quite a gulf between these communities and the architects, designers, developers and programmers who have to build systems conforming to privacy requirements.

So, I'm going to write a small set of tutorials focusing on various aspects of privacy from a more technical perspective.

I've written before on what we need to cover, at least from an academic perspective, now after much working at the coal face with R&D teams, it is time to actually get some of this written down.

Watch this space...

Friday, 4 January 2013

Nouvelle cuisine meets big data?

After the post Christmas/holidays eating binge and the inevitable New Year's resolution of dieting, it seems like this article that appears in the Visual Business Intelligence blog by Stephen Few is more than a little apt.

An analogy is made between the fast food and slow food movements (hence the 'apt' earlier) and he article presents the argument that taking a much more measured, "slower" approach to data collection is the way forward.

As the slow food movement emphasizes the preparation, cooking, eating and enjoyment of food (as opposed to the idea of fast food), slow data emphasizes the same when collecting, storing, processing and analysing data. Given the rush to collect as much data as possible with often scant regard to the content, identification and value I can see the appeal of this. Regarding this latter point, an article in The Register: Craptastic analysis turns 2.8 zettabytes of Big Data into 2.8 ZB of FAIL - dramatically explains this.

Taking a moment to really think about what we are gathering, why we are gathering and what we intend to do with the data, other than the attempt construct the perfect advertisement, then we will see that we actually need very little data. This is almost an antithesis to the 'capture and collect everything everywhere just in case we might need it later' approach.

And then with that carefully chosen, semantically well-defined data we are able to process and analyse and enjoy the results of our analysis. The enjoyment of data here being that what is understood from the data and that this understand be more relevant and useful to our business and to our customers.

Taking a slow approach to digesting our data has a number of other side-effects, a few being that the amount of data we store is less, the amount of analytical infrastructure will be less and the value to the consumer (and the business) will be significantly greater as we will not have to sift through billions of uninteresting and irrelevant data-points. The effects upon areas such as privacy should be self-explanatory...indeed isn't this the true goal of the privacy advocates?

A slow data approach might just solve many of the issues we see with data: semantics, isolation, privacy, data storage, analytics, just to mention a few.

Indeed as the slow food movement has as its objectives to enjoy food, so slow data might just be the way through which we appreciate the information, knowledge and wisdom in our big data.

Are we in effect embodying the ideals of nouvelle cuisine as applied to data? A rejection of excessive complication, reducing the processing to preserve the natural information content, the freshest and best possible ingredients, smaller data sets, modern processing techniques, innovation and invention as being drivens because of the data collection (and not because they might happen if we collect data) - the analogies between slow food, nouvelle cuisine and slow data are abundant.

Food for thought...almost literally.