Sunday, 24 November 2013

Privacy, Evidence Trails and a Change in Terminology?

One of the main aspects of personal [information] privacy is that much of the topic is that other parties would not collect nor perform any analysis of your data. The trouble is that this argument is often made in isolation, in that it somewhat assumes that the acts we perform by computer exist in a place where we can hide. For example, what someone does behind closed doors usually remains private. But, if that act is made in a public place, say, in the middle of the street by default whatever is done is not private - even if we hoped no-one saw.

Anything and everything we do on the internet is in public by default. When we perform things in public, then other people may or will see, find out and perform their own analysis to form a profile of you.

Many privacy enhancing technologies are akin to standing in the middle of a busy street and shouting "don't look!". Even if everyone looks away, more often than not there is a whole raft of other evidence to show what you've been doing.

Admittedly most of the time nobody really cares nor are actually looking in the first place. Though as it has been found out recently (and this really isn't a surprise) that some such as the NSA and GCHQ are continually watching. Even the advertisers don't really care that much; their main interest is trying to categorise you to ship a generic advertisement - and advertisers are often really easy to game...

If we really do want privacy on the internet then rather than concentrating on how to be private (or pretending that we are), we need to concentrate on how to reduce the evidence trail that we leave. Such evidence is in the form of web logs, search queries, location traces from your navigator, tweets, Facebook postings etc.

Once we have understood what crumbs of evidence is being left, we can start exploring all the side avenues where data flows (leaks) and the points where data can be extracted surreptitiously. We can also examine what data we do want released, or have no choice about.

At this moment, I don't really see a good debate about this, at least not at a technical level though there are some great tools such as Ghostery that assist in this. Certainly there is little discussion at a fundamental level which would really help us define what privacy really is.

I personally tend to take the view at the moment that privacy might even be the wrong term, or at best, somewhat a misleading term.

On the internet every detail of what we do is potentially public and can be used for good as well as evil (whatever those terms actually mean), our job as privacy professionals is to make that journey as safe as possible, hence the use of the term "information safety" to better describe what we do.

Friday, 22 November 2013

Semiotic Analysis of a Privacy Policy

Pretty much every service has a privacy policy attached to it; these policies state the data collection, usage, purpose and expectations that the customer has to agree with before using that said service. But at another level they also attempt to signify (that's going to be a key word) that the consumer can trust the company providing the service at some level. Ok, so there's been huge amounts of press about certain social media and search service provides "abusing" this trust and so on, but we still use the services provided by those companies.

So this gets me thinking, when a privacy policy is written, could we analyse that text to understand better the motives and expectations from both the customer and the service provider perspective? Effectively can we make a semiotic analysis of a privacy policy.

What would we gain by this? It is imperative that any texts of this nature portray the right image to the consumer, thus this can be used in the drafting of such a text to ensure that this this right image is correctly portrayed. For example, the oft seen statement:

"Your privacy is important to us"

is a sign in the semiotic sense, and in this case probably an 'icon' in its near universal usage. Signs are a relationship between the 'object' and 'interpretant', respectively the subject matter at hand and the clarified meaning respectively.

Pierce's Semiotic Trangle
The object may be that we (the writer of the statement) are trying to state a matter of fact about how trust worthy we are, or at least we want to emphasise that we can be trusted.

The interpretant of this, if we are the customer, can of course vary from total trust to utter cynicism. I guess of late the latter interpretation tends to be true. Understanding the variation in interpretants is a clear method for understanding what is being conveyed by the policy itself and whether the right impression is being given to the consumer.

At a very granular level the whole policy itself is a sign and the very existance of that policy and its structure, is it long and full of legalese or short and simple? Then there's the content (as described above) which may or may not depend upon the size of the in the World's Worst Privacy Policy.



Found this paper while researching for this: Philippe Codognet's THE SEMIOTICS OF THE WEB, it starts with a quote:
I am not sure that the web weaved by Persephone in this Orphic tale, cited in exergue of Michel Serres’ La communication , is what we are currently used to call the World Wide Web. Our computer web on the internet is nevertheless akin Persephone’s in its aims : representing and covering the entire universe. Our learned ignorance is conceiving an infinite virtual world whose center is everywhere and circumference nowhere ...
Must admit, I find that very, very nice. Best I've got is getting quotes about existential crises and cosmological structures in a paper about the Semantic Web with Ora Lassila.

Tuesday, 19 November 2013

Losing Situational Awareness in Software Engineering

The cause of many aircraft accidents is attributed to loss of situational awareness. The reasons for such a situation are generally due to high workload, confusing or incorrect data and misinterpretation of data by the pilot.

The common manifestation of this is mode confusion where the pilot's expectation for a given situation (eg: approach) does not match the actual situation. This is often seen (though not exclusively) in highly automated environments, leading to the oft repeated anecdote:

A novice pilot will exclaim, "Hey! What's it doing now?", whereas an experienced pilot will exclaim, "Hey! It's doing that again!"

Aside: while this seems to be applied to Airbus’ FBW models, the above can be traced back to the 
American Airlines Childrenof the Magenta tutorial and specifically refers to the decidedly non-FBW Boeing 757/767 and Airbus A300 models, and not the modern computerized machines… [1]

This obviously isn't restricted to aircraft but also to many other environments; for example, consider the situation in a car when cruise control is deactivated due to an external situation not perceived by the driver. Is the car slowing due to road conditions, braking etc? Consider this in conjunction when you are expecting the car to slow down where the deactivation has the same effect as expected; this type of loss of situational awareness was seen in the Turkish 737 accident at Schipol.

In an auditing environment where we obtain a view horizontally across a company we too suffer from loss of situational awareness. In agile, fast environments with many simultaneous projects, we often fail to see the interactions between those projects.

Yet, understanding these links in non-functional areas such as security and privacy is absolutely critical to a clear and consistent application of policy and decisions.

A complicating factor we have seen is that projects change names, components of projects and data-sets are reused both dynamically and statically leading to identity confusion. Systems reside locally, in the cloud and elsewhere in the Universe, terminology is stretched to illogical extremes: big data and agile being two examples of this.  Simplicity is considered a weakness and complexity a sign of the hero developer and manager. 

In systems with safetycritical properties heroes are a bad thing.

In today's so-called agile, uber-innovative, risk-taking, fail fast, fail often, continuous deployment and development environments we are missing the very basic and I guess old fashioned exercise of communicating rigorously and simply what we're doing and reusing material that already exists.

Fail often, fail fast, keep making the same mistakes and grow that technical debt.

We need to build platforms around shared data, not functionality overlapping, vertical components with the siloed data mentality. This required formal communication and an emphasis on quality, not on the quantity of rehashing buzzwords from the current zeitgeist.

In major construction projects there is always a coordinator whose job it is to ensure that not only do individual experts communicate (eg: the plumbers, the electricians etc) but that their work is complimentary and that one group does not repeat the work or base that another team has put in place.

If software engineers built a house, one team would construct foundations by tearing down the walls that inconveniently stood atop of already built foundations, while another would build windows and doors while digging up the foundation as it is being constructed. A further team charged with installing the plumbing and electrics would first endeavor to invent copper, water and electricity...and all together losing awareness of the overall situation.

OK, I'm being hard on my fellow software engineers, but it is critical that we concentrate more on communication, common goals, less "competition" [2] and ensuring that our fellow software engineers are aware of each other's tasks.

As a final example, we will (and have!) seen situations in big data where analytics is severely compromised because we failed to communicate and decide upon common data models, common catalogs of where and what data exists.

So, like the pilot faced with a situation where he recognises he's losing his situational awareness, drop back down the automation and resort to flying by the old-fashioned methods and raw data.

The next question is, how do you recognise that you're losing situational awareness?

Wednesday, 13 November 2013

What does "unclassified" mean?

Previously we presented a simple security classification system consisting of four levels: secret, confidential, public and unclassified; with the latter case being the most restrictive.

While sounding counter-intuitive, a classification of "unclassified" simply means that no decision has been made. And if no decision on the actual classification has been made then it is possible that in the future that classification might be decided to be "secret". Which implies that if "unclassified" needs to be at least as strong as the highest explicit classification.

Aside: Fans of Gödel might be wondering that how can be talk about classifying something as being unclassified, which in turn is higher in some sense than the highest classification. Simply, "unclassified" is a label which is is attached to everything before the formal process of attaching on of the other labels is made.

Let's start with an example: if Alice writes a document then this document by default is unclassified. Typically that document's handling falls under the responsibility of Alice alone.

If Alice gives that document to Bob, then Bob must handle that document according to Alice's specific instructions.

By implication Alice has chosen a particular security classification. There are two choices:

  1. Either an explicit classification is given, eg: secret, confidential, public
  2. Or, no classification is given an Alice remains the authority for instructions on how to handle that document
In the latter case Alice's instructions may be tighter than the highest, explicit classification, which implies that unclassified is more restricting than, say, secret.

If Bob passes the document to Eve (or to the whole company by a reply-all) then we have a data breach. The document never implicitly becomes public through this means; though over time the document might become public knowledge but still remain officially secret. For example, if an employee of a company leaks future product specifications to the media, even though they are now effectively public, the employee (and others) who handled the data would still fall under whatever repercussions leaking secret or confidential data implies.

Still this is awkward to reconcile, so we need more structure here to understand what unclassified and the other classifications mean.

We must therefore apply to a notion of authority: all documents must have an owner - this is basic document handling. That owner

  1. Either assigns an explicit security classification, and all handlers of that document refer to the standard handling procedures for that security classification: referring to the security classifications standard as the authority
  2. Or, keeps the document as being unclassified and makes themselves the authority for rules on how to handle that document
The latter also comes with the implication that the owner of the document here is also responsible for ensuring that whatever handling rules are implied, these are consistent with the contents of the document. For example, if the document contains sensitive data then in our example, Alice is responsible for ensuring that the rules that come from her authority are as at least as strict as the highest implied security classification.

In summary, if a document or data-set is unclassified then the owner of that document is the authority deciding on what the handling rules are and that by default the rules must be at least* as strict has the highest explicit security category.

*In our classification we have the relationship:

Public < Confidential < Secret

with the statement above saying:

Public < Confidential < Secret <= Unclassified

As a final point, if Alice decides that her rules are weaker than say, confidential, but stronger than public, then it makes sense to take the next highest level as the explicit classification, ie: confidential. This way we establish the policy that all documents must eventually be explicitly classified.

Wednesday, 6 November 2013

Classifying Data Flow Processes

In previous posts (here, here, here and especially here) we presented various aspects for classifying data nominally being sent over data flows or channels between processes. Now we turn to the processes themselves and tie the classification of channels together.

Consider the data flow shown here (from the earlier article on measurement, from where we take the colour scheme of the flows) where we show the movement of, say, location information:

Notice how data flows via a process called "Anonymisation" (whatever that word actually means!) to the advertiser. During the anonymisation the location data is cleaned to a degree that business or our privacy policy allows - such boundaries are very context dependent.

This gives the first type of process, one that reduces the information content.

The other type of processes we see are those that filter data and those that combine or cross-reference data.

Filtering is where data is extracted from an opaque or more complex set of information into something simpler. For example if we feed some content, eg: a picture, then we might have processes that filter out the location data from said pictures.

Cross-referencing means that two or more channels are combined to produce a third channel containing information from the first two. A good example of this is geo-locating IP addresses which takes in as one source an IP address and another a geographical location look-up table. Consider the example below:

which combines data from secondary and primary sources to produce reports for billing, business etc.

In the above case we probably wish to investigate the whole set of processes that are actually taking place and make a considerable amount of decomposition of the process.

When combined with the classifications on the channels, particularly information and security classes we can make some substantial reasoning. For example, if there is a mismatch in information content and/or security classifications then we have problems; similarly if some of these are transported over insecure media.

To summarise, in earlier articles we explained how data itself may be classified, and here how processes may be classified according to a simple scheme:

  • Anonymising
  • Filtering
  • Cross-Referencing

In a later article I'll go more into the decomposition and refinement of channels and processes.

Friday, 1 November 2013

Measurement and Metrics for Information Privacy

We have already discussed an analogy for understanding and implicitly measuring the information content over a data channel. The idea that information is an "infectious agent" is quite a powerful analogy in the sense that it allows us better to understand the consequences of information processing and the distribution of that data better, viz:
  • Any body of data which is containing certain amounts and kinds of sensitive data we can consider to be non-sterile
  • Information which is truly anonymous is sterile
  • Mixing two sets of information produces a single set of new information which is as at least as unclean as the dirtiest set of data mixed, and usually more so!
  • The higher the security classification the dirtier the information
Now we can using our information classification earlier introduced we can further refine our understanding and get some kind of metric over the information content.

Ley us classify information content into seven basic categories: financial, health, location, personal, time, identifiers and content. Just knowing what kinds of data are present as we have already discussed gives us a mechanism to pinpoint where more investigation is required.

We can then go further an pick out particular data flows for further investigation and then map this to some metric of contamination:

For example, transporting location data has a particular set of concerns, enough to make any privacy professional nervous at the least! However if we examine a particular data flow or store we can evaluate what is really happening, for example, transmitting country level data is a lot less invasive than highly accurate latitude and longitude.

Now one might ask why not deal with the accurate data initially? The reasons are that we might not have access to that accurate, field-level data, we might not want to deal with the specifics at a given point in time, specific design decisions might not have been made etc.

Furthermore, for each of the seven categories we can give some "average" weighting and abstract away from specific details which might just complicate any discussion.

Because we have a measure, we can calculate and compare over that measurement. For example, if we have a data channel carrying a number of identifiers (eg: IP, DeviceID, UserID) we can take the maximum of these as being indicative of the sensitivity of the whole channel for that aspect.

We can compare two channels, or two design decisions, for example, a channel carrying an applicationID is less sensitive (or contaminated) than one carrying device identifiers.

We also can construct a vector over the whole channel composed out of the seven dimensionsb above to give a further way of comparing and reasoning about the level of contamination or sensitivity:
| (FIN=0,HLT=0,LOC=10,PER=8,TIM=3,ID=7,CONT=0) | 
| (FIN=3,HLT=2,LOC=4,PER=4,TIM=2,ID=9,CONT=2) |
for some numerical values gien to each category. Arriving at these values will be specific to a given wider context and then the weighting given to each, but there is one measure which can be used to ground all this, and that is of information entropy, or, how identifiying the contents are to a given, unique human being. A great example of this is given at the EFF's Panopticlick pages.

We've only spoken about a single data flow at the moment, however the typical scenario is for reasoning over longer flows, for example, we might have our infrastructure set up as below*

In this example we might look at all the instances where AppID and Location are specified together and use a colour coding such that:
  • Black: unknown/irrelevant
  • Red: high degree of contamination, both AppID and Location unhashed and accurate respectively
  • Yellow: some degree of contamination, AppID may be hashed(+salt?) or Location at city level or better
  • Green: AppID randomised over time, hashed, salted and Location at country level or better
Immediately readable from the above flow are our points of concern which need to be investigated, particular the flows from the analytics processing via the reports storage and to direct marketing. It is also easy to see that there might be a concern with the flow to business usages, what kinds of queries ensure that the flow here is less contaminated than the actual reports storage itself?

There are a number of points we have not yet discussed, such as that some kinds of data can be transformed into a different type. For example some kinds of content such as pictures inherently contain device identifiers, locations etc. Indeed the weighting for some categories such as content might be very much higher than that of identifiers for example - unless the investigation is made. Indeed it does become almost a trivial exercise for some to explicitly hide sensitive information inside opaque data such as images and not declare then when a privacy audit is made.

To summarise, we've a simple mechanism for evaluating information content over a system in quantative terms, complete with a refinement mechanism that allows us to work at many levels of abstraction depending upon situation and context. Indeed, what we are doing is explcitly formalising and externalising our skills when performing such evaluations, and through analogies such as "infection control" providing means for easing the education of professionals outside of the privacy area.

*This is not indicative of any system real, living or dead, but just an example configuration.