Saturday, 19 July 2014

A Privacy Engineer's Bookshelf

There's a huge amount of material about privacy, software engineering etc already existing. So what should every privacy engineer have at minimum on his or her bookshelf? Here are my suggestions (I might be biased in some cases) which I think everyone working privacy should know about.
The reasoning behind the above is that entering the privacy engineering field one needs a good cross section and balance of understanding the legal and ethical foundations of privacy (Nissenbaum, Solove) through the software engineering process (Dennedy et al) to the actual task of modelling and analysing the system (Oliver). Scheneier's book is included to provide a good perspective on the major protection technology of encryption.

Of course this does not preclude other material nor a thorough search through the privacy literature, conferences and academic publications.

To be a privacy engineer really does mean engineering and specifically system and software engineering skills.


Monday, 14 July 2014

Final Proof...

And so it starts...the final proof before publication...

Privacy Engineering - a Data Flow and Ontological Approach

Friday, 11 July 2014

Privacy Engineering Book Update

Well it has been a while since I posted last and that's primarily because I've been making to push to finalise the book. I think it was Hemmingway that said that a book is never truly finished but reaches a state where it can be abandoned...

Well, this is very much the same. I'm happy with the draft, it contains the story I want to tell so far in as much detail I can put into it at the moment. This doesn't mean many chapters could have gone much further but there have to be compromises in content. If the book provides enough to tie these areas of ontology, data flow, requirements etc together and get the reader to a state where they can see that structure and use the references to move deeper into the subject, then this will have been a success.

I'll write more when I finally send the book for official publication next week.

But, what a journey...just like a PhD but without all the fun of being a student again :-)

www.facebook.com/privacyengineering
www.privacyengineeringbook.net


Monday, 9 June 2014

Word Clouds

Just a bit of fun, but also quite nice to see an overall idea of what I write about on this blog. Generated by Wordle:


So it seems I'm seriously interested in engineering, privacy (privacy engineering too!), data, technologies, analytics and so on.

Friday, 6 June 2014

Privacy Engineering - The Book ... real soon now, I promise

Final push to complete the draft. Many many thanks to all that have provided numerous comments...We're probably looking at late July after the editorial process and the production of the first proof copy.


Ian Oliver (2014) Privacy Engineering - A Dataflow and Ontological Approach. ISBN 978-1497569713

Official Website: www.privacyengineeringbook.net
Facebook:  facebook.com/privacyengineering


Tuesday, 3 June 2014

Privacy SIG @ EIT Labs

Yesterday I was fortunate enough to be given the chance to speak at the founding of the a Privacy Special Interest Group (facilitated by EIT ICT Labs) on the subject of privacy engineering and some of the technologies and areas that will make up the future of privacy engineering technologies.

The presentation is below (via SlideShare):


The PrivacySIG group's charter is simply:
The Privacy Special Interest Group is a non profit organisation consisting of companies which are developing or involved in the next generation of visitor analytics. We work hard to ensure we can build a future where everybody can benefit from the new technologies available. The Privacy Special Interest Group has developed and maintains a common "Code of Conduct" which is an agreement between all members to follow common rules to ensure and improve the privacy of individuals. We also work on educating our customers, the media and the general public about the possibilities and limitations of the new technology. We also maintain a common opt-out list to make it easy for anyone who wishes to opt-out in one step, this list is used by all our members. Any company who agrees to follow the code of conducts is qualified to join.
This is certainly a worthwhile initiative and one that really has taken the need for an engineering approach to privacy as part of its ethos.

Wednesday, 28 May 2014

How much data?!?!

I took part as one of the speakers in a presentation about analytics today; explaining how data is collected through instrumentation of applications, web pages etc, to an audience who are not familiar with the intricacies of data collection and analytics.

We had a brief discussion about identifiers and what identifiers actually are which was enlightening and hopefully will have prevented a few errors later on. This bears explaining briefly: an identifier is rarely a single field, but should be considered any one of the subsets of the whole record. There are caveats there of course, some fields can't be used as part of some compound identifier, but the point here was to emphasis that you need to examine the whole record not just individual fields in isolation.

The bulk of the talk however introduced from where data comes from. For example if we instrument an application such that a particular action is collected, then we're not just collecting an instance of that action but also whatever contextual data provided by the instrumentation and the data from the traffic or transport layer. This came as a surprise that there is so much information available via the transport/traffic layers:

Said meta-data includes location, device/application/session identifiers, browser and environment details and so on, and so on...

Furthermore data can be cross-referenced with other data after collection. A canonical example is geolocation over IP addresses to provide information about location. Consider the case where a user switches off the location services on his or her mobile device; location can still be inferred later in the analytics process to a surprisingly high-level of accuracy.

If data is collected over time, then even though we are not collecting specific latitude-longitude coordinates we are collecting data about movements of a single, unique human being; even though no `explicit' location collection seems to be being made. If you find that somewhat disturbing, consider what happens every time you pay with a credit card or use a store card.

Then of course there's the whole anonymisation process where once again we have to take into consideration not just what an identifier is, but the semantics of the data, the granularity etc. Only then can we obtain an anonymous data set. Such a data set can be shared publicly...or maybe not as we saw in a previous posting.  

Even when one starts tokenising and suppressing fields, the k-anonymity remains remarkably low, typically with more than 70% of the records remaining unique within that dataset. Arguments about the usefulness of k-anonymity notwithstanding - on the other hand it is one of the few privacy metrics we have,

So, the lesson here is rather simple, you're collected a massive amount more than you really think.

The next surprise was how tricky or "interesting" this becomes when developing a privacy policy that contains all the necessary details about data collection, meta-data collection, traffic data collection; and then the uses to which that data is put, whether it is primary or secondary collection and so on.

Friday, 23 May 2014

Surgical privacy: Information Handling in an Infectious Environment

What has privacy engineering, data flow modelling and analysis got to do with how infectious materials and the sterile field are handled in medical situations?  Are there things we can learn by exploiting by drawing an analogy between these seemingly different fields?


We've discussed this subject earlier and a few links can be found here. Indeed privacy engineering has a lot to learn from analogous environments such as aviation, medicine, anaesthesia, chemical engineering and so on; the commonality here is that those environments understood they had to take a whole systems approach rather than relying upon a top-down driven approach or relying upon embedding the semantics of the area in one selected discipline.

Tuesday, 20 May 2014

Foundations of Privacy - Yet Another Idea

Talking with a colleague about yesterday's post on "a" foundation for privacy, or privacy engineering, he complained that the model wasn't complete. Of course the structuring is just one possible manifestation and others can be put together to take into consideration other views, or to provide a semantics of privacy in differing domains. For example, complete with semantic gaps, we might have a model which presents privacy law and policies in terms of economic theory which in turn is grounded in mathematics:


Then place the two models side-by-side and "carve" along the various tools, structures, theories etc that each uses and note the commonalities and differences, and then try to reconcile those.


The real challenge here is to decompose each of those areas into those theories, tools etc that are required to properly express each level. Then for each of those areas such as listed earlier,eg: type theory, programming, data flow, entropy etc, map each of these together. For example, a privacy policy might talk about anonymity and in turn anonymity of a data set can be given a semantics in terms of entropy.

Actually this is where the real details are embedded and we the levels as we have depicted them are vague, fuzzy classifications for convenience of grouping  these together.

Monday, 19 May 2014

Foundations of Privacy - Another Idea

This got triggered by a post on LinkedIn about what a degree in privacy might contain. I've certainly thought about this before, at least in terms of software engineering, and even have a whole course that could be taken over a semester ready to go.

Aside: CMU has the "World's First Privacy Engineering Course": a Master of Science in Information Technology—Privacy Engineering (MSIT-PE) degree. So, close, but a major university here in Finland turned down the chance to create something similar a few years back...

That aside, I've been wondering about how to present they various levels of things we need to consider to properly define privacy and put it on strong foundations. Though in the guise of information theory we already have this, though admittedly Shannon's seminal work from the 1930's is maybe a little too deep. On the other hand understanding concepts such as channels, entropy are fundamental building blocks, so maybe they should be there along with privacy law - now that would make some course!

Even just sketching out areas to present and what might be contained therein...how about this, even if a linear map from morality to mathematics is too constraining?



There are missing bits - we still have a  semantic gap between the "legal world" and the "engineering world"; parts that I'm hoping that things such as the many conferences, academic works and books such as the excellent Privacy Engineer's Manifesto and Privacy Engineering will play a role in defining. Maybe the semantic gap goes away once we start looking at this...is there even a semantic gap? 

However, imagine for a moment starting anywhere in this stack and working up and down and keeping everything linked together in the context of privacy and information security. Imagine seeing the link between EU privacy laws and type theory, or between the construction of policies and entropy, the algebra of HIPAA, a side course in homotopy type theory and privacy...maybe with that last one I'm getting carried away, but, this is exactly what we need to have in place.

Each layer provides the semantics to the layer above - what do our morals and ethics means in terms of formalised laws, what do laws mean in terms of policies, what do policies mean in terms of software engineering structures, and down to the core mathematics and algebras of information.

Privacy and privacy engineering in particular almost has everything: law, algebra, morals, ethics, semantics, policy, software, entropy, information, data, BigData, Semantic Web etc etc etc. Furthermore, we have links to areas such as security, cryptography, economic theory etc!

Aren't these the very things any practitioner of privacy (engineering) should know, or at least have knowledge of? Imagine if lawyers understood information theory and semantics, and, software engineers understood law? 

OK, so there might be various ways of putting this stack together, competing theories of privacy etc, but that would be the real beauty here - a complete theory of privacy from the core mathematics through physics, computation, type theory, software engineering, policies, law and even ethics and morals.

But again, no more naivety, no more terminological or ontological confusions, policies and laws being traceable right down to the computation structures and code. Quite a tall order, but such a course bringing all these together really would be wonderful...

And wouldn't that be something!

An Access Control Paradox

The canonical case for data flow and privacy is some data collection from a set of identifiable individuals and generate insights (formerly called reports) about these. In order to protect privacy we will apply the necessary security and access controls and anonymisation of log files as necessary.

Let's consider the case where where generate a number of reports, and we'll order them according to some metric of their information content and specifically how easy or possible it is to re-identify the original sources.

Consider the system below, we collect from a user their user ID, device ID and location - this is some kind of tracking application, or for that matter, any kind of application we typically have on our mobile devices, eg: something for social media, photo sharing etc...




We've taken necessary precautions for privacy - we'll assume there's notice and consent given - in that the user's data is passed using a secure channel into our system. Process of this data takes place and we generate two reports:
  1. The first containing specific data about the user
  2. The second using some anonymous ID associated with certain event data for logging purposes only. This report is very obviously anonymous!
For additional security purposes we'll even restrict access to the former because it contains PII - but the second which is anonymous doesn't need such protection.

In many cases this is considered sufficient - we've the notice and consent and all necessary access controls and channel security. Protecting the report or file with the sensitive data in it is a given. But now the less sensitive data is often forgotten in all of this:
  • How is the identifier generated?
  • How granular is the time stamp?
  • What does the "event" actually contain?
  • Who has access?
  • How is this all secured?
Is the identifier some compound of data, hashed and salted, for example:
salt = "thesystem";id = sha256( deviceId + userid + salt);
This would at least allow analysis over unique user+device combinations and the salt, if specific to this logfile or system, then restricts matching to this log file only. Assuming of course the salt isn't know outside of here. 

The timestamp is of less importance but if of very high granularity would prevent the sequencing of events.

The contents of the event are always interesting - what data is stored there? What needs to be and how? If this is some debug log then there's probably just as much here as there is in the report containing the PII. Often it might just be stack traces (with or without parameters), or memory dumps - both of which contain interesting data, even if it is just a pointer to where a weakness in the system might exist.

Now come the questions of who has access and how is this secured? Given that such a report has interesting content shouldn't this be as secure as the report containing specific and identifiable user data? If there's some shared common knowledge could rainbow tables of hashes etc be constructed?

Consider this situation:



Where two separate systems exist, but there exists a common path between these systems which can be exploited because access control wasn't considered necessary for such "low grade", non-personal data.

Any common path is the precursor to de-anonymisation of data.

This might seem to be a rather trivial situation, except that such shared access and common knowledge of things such as salts, keys etc exist in most companies, large and small. In the latter it is often hard to avoid. Mechanisms such as employee contracts and awareness training actually do very little to solve this problem as they aren't designed to address or even understand this problem.

And here lies the paradox of access control: while we guard reports, files, datasets containing PII, we fail to address the same when working with anonymous data - whatever anonymous means.













Monday, 12 May 2014

Privacy and Big Data in Medicine

A short article by myself on the subject of privacy in medicine was just published in the web magazine Britain's Nurses. Quite an experience writing for a very different audience than software engineers, but extremely interesting to note the similarities between the domains.

When it comes to privacy, one of the seemingly infinite problems we face is how to develop the techniques, tools and technologies in our respective domains. Here again we have the choice of reinventing the wheel or looking to different domains and use their knowledge and experiences. This latter route is the much preferred but rarely taken.

So for the moment, I'll take the chance to look back on previous articles that draw lessons from other domains:
Domains such as medicine, civil engineering and especially aviation have been through this process and as information rises in value - that is the economic effects of a data breach or loss of consumer confidence - reach levels where companies will figuratively crash, so the need to take in these learnings and treat information handling as any other element in a safety-critical system.

Finally the article I mentioned: Privacy in Digital Health, 12 May 2014, Britain's Nurses

Thursday, 8 May 2014

Checklists and Design by Contract

One of the problems I am having with checklists is that they are often, or nearly always, confused with processes: "this is the list of steps we have to do and then you tick them off and all is well" mentality. This is probably why in some cases checklists have been renamed "aide memoirs" [1] and why their use and implementation is so misunderstood.

In the case of aviation or the surgical checklists these do not signify whether it is "safe" to take-off or start whatever procedure but as a reminder to the practitioner and supporting team that they have reached a place where they need to check on their status and progress. The decision to go or no-go is not the remit of the checklist. For example, once a checklist is complete a pilot is free to choose whether to take-off or not irrespective of the answers given to the items on the checklist (cf: [2]).

This got me thinking in that there are some similarities to design-by-contract and this could be used to explain checklists better possibly. For example consider the function to take-off (written in pseudo Eiffel fragments [3]):

     take-off
         -- get the throttle position, brake status etc and spool-up engines
     do
        ....
     end
...

can be called whenever, there is no restriction and this is how it was, until an aircraft crash in the 1930's triggered the development of checklists in aviation. So now we have:

     take-off
         -- get the throttle position, brake status etc and spool-up engines
     require
          checklist_complete = True
     do
        ....
     end
...

and in more modern aircraft this is supplemented by features to specifically check on the aircraft status

     take-off
         -- get the throttle position, brake status etc and spool-up engines
     require
          checklist_complete = True
     do
        if flaps < 10 then 
            soundAlarm 
       end
        ....
     end
...

or even:

     take-off
         -- get the throttle position, brake status etc and spool-up engines
     require
          checklist_complete = True
          flaps > 10
          mode = GroundMode
     do       
        ....
     ensure
          mode = FlightMode
     end
...

What you actually see are specific checks from the checklists being incorporated into the basic protection mechanisms of the aircraft functionality. This is analogous to what we might see in a process, for example below we can see the implementation of functionality to encode a project approval checklist into some approval function:

    approveProject
      require
         securityReview.status = Completed
         privacyReview.statuse= Completed
         continuityReview.status = Completed
         performanceReview.status = Completed
         architecturalReview.status = Completed
      do
         ...

Now we have said nothing about how the particular reviews were actually made or whether the quality of their results were sufficient. This brings us to the next question of the qualitative part of a checklist and deciding what to expose. Here we have three options:

  1. completion
  2. warnings
  3. show stopping preconditions

The first is as explained above, the second and third offer us a choice about how we expose and act upon the information gained through the checklist. Consider a privacy or information content review of system, we would hope that specific aspects are specifically required, while others are just warnings:

    approveProject
      require
         ...
         privacyReviewStatus = Completed
         privacyReview.pciData = False
         privacyReview.healthData = False
         ...
      do
         if privacyReview.dataFlowModelComplete = False then warn("Incomplete DFDs!") end
         ...

And we can get even more complex and expose more of the checklist contents as necessary.

The main point here is that if we draw an analogy with programming, some aspects of checklists can be more easily explained. Firstly the basic checklist maxim is:

All the items on a checklist MUST be checked.

then we should be in a place to make a decision based on the following "procedure"

  1. Are all individual items in their respective parameter boundaries?
  2. Are all the parameters taken as a whole indicating that we are in a state that is considered to be within our definition of "safe" to proceed to the next state?
  3. Final question: Go or No-Go based on what we know from the two questions above?
Of course, we have glossed over some of the practical implementations and cultural aspects such as team work, decision making and cross-referencing, but what we have described is some of the philosophy and implementation of checklists in a more familiar to some programming context.


References

[1] Great Ormond Street Hospital did this according to one BBC (I think) documentary.
[2] Spanair Flight 5022 

Tuesday, 15 April 2014

PbD, The Privacy Engineer's Manifesto and Privacy Engineering

Had quite a bit of time to rereview the relationship between the foundational principles of PbD, the excellent book Privacy Engineer's Manifesto and my Privacy Engineering book. To me this is how it looks and finally I think we're starting to see a proper balance between these.

The seven foundational principles of Privacy by Design are well known throughout the privacy community and together they stand as an ideal focus for the development of privacy over our information systems as the Agile Manifesto did for software development processes.
  1.  Proactive not Reactive; Preventative not Remedial
  2.  Privacy as the Default Setting
  3.  Privacy Embedded into Design
  4.  Full Functionality – Positive-Sum, not Zero-Sum
  5.  End-to-End Security – Full Lifecycle Protection
  6.  Visibility and Transparency – Keep it Open
  7.  Respect for User Privacy – Keep it User-Centric
As time has shown misunderstanding and incorrectly applying the prinicples of the Agile Manifesto has lead to severe development problems and technical debt.


One only needs to look at the modern application of the term agile to understand that its original meaning in many cases has been lost; such is the danger facing the principles of Privacy By Design and even now statements such as 'We Follow PbD Princples' are abound without any underpinning or engineering understanding of those principles in either code or process.

To move forward we must precisely understand how these principles can be integrated not just in to policies, but engineering requirements, design requirements, test cases, software development processes, analysis tools, development tools and even the very psyche of software engineering. Efforts such as the Privacy Engineer's Manifesto take the first step in addressing these aspects and the relationship between PbD.

However working from a purely top-down perspective does not solve all problems, but one needs to work simultaneous bottom-up from basic engineering and deeper theoretical perspectives and ensure that both directions of thought complement, balance and produce a consistent whole. We take the bottom-up approach here and do not attempt to define precise processes but rather present ontologies, structures and tools which can be adapted as local development practices require and dictate.