Sunday, 30 September 2012

Grothendieck Biography

Mathematics is full of "characters", Grigori Perelman, Pierre de Fermat, Évariste Galois, Paul Erdős, Andrew Wiles** to name just a few and each having their own, unique, wondrous story about their dedication to their mathematical work and life.

Perhaps none more so than Alexander Grothendieck exemplifies the mathematician and since 1991 lived as a recluse in Andorra. However since the body of work and his contribution to mathematics, particularly, category theory and topology has been almost legendary he retains a great mystery about him.

In order to understand Groethendieck and possibly the mind of the mathematician a series of biographies of Groethendieck are being written by Leila Schneps. The current draft and extracts can be found on her pages about this work.

I'll quote a paragraph from Chapter 1 of Volume II, that gives a flavour of Groethendieck's work and approach to mathematics:

Taken altogether, Grothendieck’s body of work is perceived as an immense tour de force, an accomplishment of gigantic scope, and also extremely difficult both as research and for the reader, due to the effort necessary to come to a familiar understanding of the highly abstract objects or points of view that he systematically adopts as generalizations of the classical ones. All agree that the thousands of pages of his writings and those of his school, and the dozens and hundreds of new results and new proofs of old results stand as a testimony to the formidable nature of the task.

This is truly a work at a scale a magnitude more detailed than, say Simon Singh's fascinating documentation about Wiles' work and proof of Fermat's Last Theorem. Suffice to say, I look forward to reading it. Maybe Simon Singh should make a documentary about Groethendieck?

** I credit Andrew Wiles with inspiring me to study for my PhD back in 1995

Saturday, 29 September 2012

Solar Flare Video

Found this via Phil Plait's amazing Bad Astronomy blog*:

On August 31st the Sun produce an immense solar flare:  click here for a picture from Nasa with the Earth to scale or better still just go straight to the 1900x1200 version. I've made a crop of the picture below just to give a teaser:**



Now Nasa and the Goddard Space Flight Center have released a video of the event:


Make it full-screen and switch to 1080p, sit back and be impressed....


* You really should read this blog every day
** Using NASA Imagery and Linking to NASA Web Sites

Thursday, 27 September 2012

Teaching Privacy

It often surprises me that many of the people advocating privacy don't actually understand the things that they're trying to keep private, specifically information. Indeed the terms data and information are used interchangeably and there is often little understanding of the actual nature and semantics of said, data and information.

I've run courses on data modelling, formal methods, systems design, semantics and now privacy - the latter however always seems to be "a taster or privacy" or "brief introduction to privacy" and there rarely is the chance to get into specifics about what information is.

This of course has some serious implications and one of the best I can find is when we talk about anonymisation. I've seen horrors such as statements "if you hash this identifier, then it is anonymous" or "if we randomise this data then we can't track" or lately, "if we set this flag to '1' then no-one will track you anymore". In the first case I refer people back to the AOL Data Leak and the dangers of fingerprinting, semantic analysis and simple cross-referencing.

I made a study a while back based on the leak of 16,000 names from various Finnish education organisations (plus maybe other places). It was very interesting to see that even with the released list that contained dates of birth and last names how many were already unique, and even in the cases where there existed common Finnish names how easy it was to trace these back to a unique person. Actually going to the next step and verifying this with that person would I guess have been somewhat illegal or if not, unethical to say the least. Social engineering would have been very easy in many of these cases I'm sure.

So given cases like these and the current dearth of educational material I though it would be nice to try to put together a more comprehensive and deeper set of material. Some universities are already doing this and there also exist industrial qualifications such as those by the IAPP, however at this stage all ideas are welcome.

Now I want to specifically address a technical audience: software engineers, computer scientists - the people who end up building these systems because that's where I feel much breaks down - for many reasons but I won't appoint blame here - that's not really constructive in the current context.

First of all I want to break things down into 3 logical segments, actually there are 4 but I'll discuss that one later:
  • Legal
  • Consumer Advocacy
  • Technical
 and address each area individually.

Legal is relatively straightforward in that an understanding of principles of privacy, how various jurisdictions view data, information, anonymisation, cross-referencing, children and minors, cross-border data transfer, retention and data collection and a discussion of certain practices, eg: EU, US, China, India etc. This discussion doesn't have to be heavy but an understanding of what the law states and how it interprets things is critical. Also from here we should get an understanding of how the law affects the engineering side of things: common terminology as a good example.

Consumer advocacy is really the overview material in my opinion - what are the principles of privacy, for example Cavoukian's Privacy by Design as an example (even if I'm not happy with the implementation of these), how to consumers view privacy, what is the reality (say vs do) and also various case studies such as how consumers view Google, Apple, Nokia, Facebook, various Governments, technologies such as NFC, mobile devices, 'Smart Televisions', direct marketing and advertising, store cards etc. Out of this comes an understanding of how privacy is viewed and even an appreciation of why we don't get privacy: anti-privacy if you like.

The technical aspect takes in many technologies, rather than describe, I'll list them (and this will be non-exhaustive and in no particular order)
  • Basic Security - Web, Encryption, Hashing, Hacking (XSS etc), authentication (OpenID, OAuth etc), differences/commonalities between privacy and security, mapping privacy problems into security problems as a solution
  • Databases - technologies, design, schema development (eg: relational theory), "schema-less" databases, cross-referencing, semantic isolation
  • Semantics - ontologies, classifications, aspect, Semantic Web
  • Data-flow
  • Distributed Systems - networking and infrastructure
  • API design - browsers, apps, web-interfaces, REST
  • Data Collection - primary vs secondary vs infrastructure, logging
  • Policy - policy languages, logic, rules, data filtering
  • Anonymisation - data cleansing
  • Identifiers - tracking, "Do Not Track"
  • User-Interface
  • Metrics for privacy - entropy
  • Information Types and Classification - location, personally identifiable information, identifiers, PCI, health/medical data
as you can see the list is extensive and an understanding of each of these areas is critical to building systems that honour and preserve privacy in its various forms (as described in the consumer advocacy and legal sections). The main point here is to provide software engineers and computer scientists with the tools to implement privacy in a meaningful manner.

Now that we have outlined the three areas we can look at the fourth which binds these together and which I tentatively call "Theory of Privacy".

Obviously something binds these areas together and there does exist a huge body of work on the nature of information and its classifications. I particularly like the approach by Barwise and Seligman in the 1997 book Information Flow: The Logic of Distributed Systems*. I believe we can quite easily get into all sorts of interesting ontology, semantics and even semiotic discussions. Shannon's Information Theory and notions of entropy (eg: Volkstein's book: Entropy and Information) are fundamental to many things. I think this really is an area that needs to be opened up and addressed seriously and anything that binds together and provides a common language to unify consumer advocacy, the law and software engineering is critical.

Finally, no outline of a course would be complete with some preliminary requirements and a book list. For the former an understanding of computer systems and basic computer security is a must (there is no privacy without security), a grounding in software engineering techniques and a dose of computer science similarly. For the books, my first draft list would include:
  • Barwise, Seligman. Information Flow
  • O'Hara, Shadbolt. The Spy in the Coffee Machine: The End of Privacy as We Know It
  • Solove. Understanding Privacy
  • Nissenbaum. Privacy in Content: Technology, Policy, and the Integrity of Social Life
  • Solove: The Future of Reputation: Gossip, Rumour, and Privacy on the Internet

*somebody should make a movie of this.

Monday, 10 September 2012

Explaining Primary and Secondary Data

One of the confusing aspects of privacy is the notion of whether something is primary or secondary data. These terms emerge from the context of data gathering and processing and are roughly defined thus:
  • Primary data is the data that is gathered for the purpose of providing a service, or, data about the user gathered directly
  • Secondary data is the data that is gathered  from the provision of that service, ie: not by the user of that service, or, data about application gathered directly
Pretty poor definitions admittedly and possibly overly broad given all the contexts in which these terms must be applied. In our case we wish to concentrate more on services (and/or applications) that we might find in some mobile device or internet service.

First we need to look more at the architectural context in which data is being gathered. At the highest level of abstraction applications run within some given infrastructure:


Aside: I'll use the term application exclusively here, though the term service or even application and service can be substituted.

Expanding this out more we can visualise the communication channels between the "client-side" and "server-side" of the application. We can further subdivide the infrastructure more, but let's leave it as an undivided whole.

In the above we see a single data flow between the client and server via the infrastructure (cf: OSI 7-layer model, and also Tanenbaum). It is this data-flow that we must dissect and examine to understand the primary and secondary classifications.

However the situation is complicated as we can additionally collect information via the infrastructure: this data is the behaviour of the infrastructure itself (in the context of the application). For example this data is collected via log files such as those found in /var/log on Unix/Linux systems, or the logs from some application hosting environment, eg: Tomcat etc. This latter case we have indirect data gathering and whether this falls under primary or secondary as defined above is unclear, though it can be though of secondary, if both our previous definitions of primary and secondary can be coerced in a broader "primary" category. (If you're confused, think of the lawyers...)

Let's run through an example: an application which collects your location and friends' names over time. So as you're walking along, when you meet a friend you can type their name into the app* and it records the time and location and stores this the cloud (formerly known as a "centralised" database). Later you can view who you met, where and at what time in all sorts of interesting ways, such as on a map. You can even share this to Facebook, Twitter or one of the many other social networking sites (are there others?).


Aside: Made with the excellent Balsamiq UI mock-up software.

The application stores the following data:

{ userId, friendsName, time, gpscoordinates }

where userId is some identifier that you use to login to the application and later retrieve your data. At some point in time we might have the following data in our database:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E

joeBloggs, Jack, 2012-09-10, 12:18,  60°10′24″N 24°56′32″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

This set of data is primary - it is required for the functioning of the application and is all directly provided by the user.

By sending this data we can not avoid using whatever infrastructure is in place. Let's say there's some nice RESTful interface somewhere (hey, you could code this in Opa!) and by accessing that interface the service gathers information about the transaction which might be stored in a log file and look something like this:

192.178.212.143, joeBloggs, post, 2012-09-10, 12:14:34.2342
64.172.211.10, arbunkleJr, post, 2012-09-10, 12:16:35.1234
192.178.212.143, joeBloggs, get, 2012-09-10, 12:16:37.0012
126.14.15.16, janeDoe, post, 2012-09-10, 12:17:22.0506

This data is indirectly gathered and contains information that is relevant to the running of infrastructure.

The two sets of data above are generally covered by the terms and conditions of using that application or service. These T&C's might also include a privacy policy explicitly or have a separate privacy policy additionally to cover disclosure and use of the information. Typical uses would cover authority requests, monitoring for misuse, monitoring of infrastructure etc. The consents might also include use of data for marketing and other purposes for which you will (or should) have an opt-out. The scope of any marketing request can vary but might include possibilities of identification and maybe some forms of anonymisation.

Note, if the service provides a method for you to share via Facebook or Twitter then this is an act you make and the provider of the service is not really responsible for you disclosing your own information publicly.

So that should explain a little about what is directly gathered, primary information and indirectly gathered information. Let's now continue to the meaning of secondary data.

When the application is started, closed or used we can gather information about this. This kind of data is called secondary because it is not directly related to the primary purpose of the application nor of the functioning of the infrastructure. Consent to collect such information needs to be asked for and good privacy practice suggests that this should be disabled by default. Some applications or services might anonymise the data in the opt-out situation (!). Secondary data collection is often presented as an offer to help with improving the quality of application or service. The amount of information gathered varies dramatically but generally application start, stop and abnormal exit (crashes) are gathered as well as major changes in the functionality, eg: moving between pages or different features. In the extreme we might even obtain a click-by-click data stream including x,y-coördinates, device characteristics and even locations from a gps.

Let's say our app gathers the following:

192.178.212.143, joeBloggs, appStart, 2012-09-10, 12:14:22.0001, WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1
192.178.212.143, joeBloggs, dataEntryScreen, 2012-09-10, 12:14:25.0001
192.178.212.143, joeBloggs, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
192.178.212.143, joeBloggs, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes
192.178.212.143, joeBloggs, mapViewScreen, 2012-09-10, 12:15:33.2342
192.178.212.143, joeBloggs, dataRequestAsList, 2012-09-10, 12:16:23.1001

What we can learn from this is how the application is behaving on the device and how the user is actually using that application. From the above we can find out what the status of the device was, the operating system version, type of device, whether the app started correctly in that configuration, from where the user started the app, which screen the app started up in, the accuracy and method of GPS positioning and so on.

So far there is nothing sinister about this, some data is required for the operation of the application and stored "in the cloud" for convenience, some data is collected by the infrastructure as part of its necessary operations and some data we voluntarily give up to help the poor application writers improve their products. And we (the user) consented to all of this.

From a privacy perspective these are all valid uses of data.

Now the problems start in three cases:
  • exporting to 3rd parties
  • cross-referencing
  • "anonymisation"
 The above data is fantastic for marketing - a trace of your location over time plus some ideas about your social networking (even if we can't directly identify who "Jane" and "Jack" are .... yet!) provides great information for targeted advertising. If you're wondering the above coördinates are for Helsinki Central Railway Station...plenty of shops and services around there that would like your attention and custom.

How the data is exported to the 3rd party and at what level of granularity is critical for trust in the service. Abstracting the GPS coordinates by mapping to city area or broader plus removal of personally identifiable information (in this case we remove the userID...hashing may not be enough!). The amount of data minimisation here is critical, especially if we want to reduce the amount of tracking that 3rd parties can make. In the above example probably just sending the location and retrieving an advertisment back is enough, especially if it is handled server-side so even the client device address is hidden.

Cross-referencing is the really interesting case here. Given the above data-sets can we deduce "Joe's" friends...taking the infrastructure log file entries:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

and cross-referencing these with the secondary data:

192.178.212.143, joeBloggs, dataEntryScreen, 2012-09-10, 12:14:25.0001
192.178.212.143, joeBloggs, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
192.178.212.143, joeBloggs, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes

we can deduce that user joeBloggs was in the vicinity of user jane123 at 12h15-12h18. Furthermore looking at the primary data:

joeBloggs, Jane, 2012-09-10, 12:15, 60°10′19″N 24°56′29″E
jane123, Funny Joe, 2012-09-10, 12:18, 60°10′20″N 24°56′21″E

we can see that joeBloggs mentioned a "Jane" and jane123 mentioned a "Funny Joe" at those times. Now we might be very wrong in the next assumption but I think it is reasonably safe to say, even when we only have a string of characters "Jane" as an identifier we can make a very reasoned guess that Jane is jane123. Actually even the 4 (ASCII) characters that just happen to spell "Jane" aren't even required, though it does help the semantic matching.

This kind of matching and cross-referencing is exactly what happened in the AOL Search Data Leak incident. Which neatly takes me to anonymisation where just because some identifier is obscured doesn't mean that the information doesn't exist.

This we often see with hashing of identifiers, for example, our app designer has been reading stuff about privacy by design and has obscured the identifiers in the secondary data using a suitably random salted hash of sufficient length to be unbreakable for the next few universes - and we've salted the IP address too!

00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f,d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f , appStart, 2012-09-10, 12:14:22.0001, WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataEntryScreen, 2012-09-10, 12:14:25.0001
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, gotGPSlocation, 2012-09-10, 12:14:26.0001, 50m accuracy, 3G positioning on, 60°10′19″N 24°56′29″E
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataPosted, 2012-09-10, 12:14:33.2342, 3G data transfer, 2498 bytes
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, mapViewScreen, 2012-09-10, 12:15:33.2342
00974ca1582cc3fc23164f93a78c647059d4c3bb170592d1385a1f777f18491f, d4c53838904f4405893b9ea134c747a2b2e7a2e9341084387285ba5999ad894f, dataRequestAsList, 2012-09-10, 12:16:23.1001

Aside: Here's a handy on-line hash calculator from tools4noobs.

We still have a consistent hash for an IP address and user identifier so we can continue to track albeit without being able to recover who made and from where come the request. Note however the content line of the first entry:

WP7, Onkia Luna 700, 2Gb RAM, 75%free, 14 processes running, started from main screen, Elisa network 3G, OS version 1.2.3.4, App version 1.1

How many Onkia Luna 700 2Gb owners running v1.2.3.4 of WP7 with version 1.1 of our application are there? Take a look at Panopticlick's browser testing to see how unique you are based on web-browser characteristics.

And then there are timestamps...let's go back to cross-referencing against our primary data and infrastructure log files and we can be pretty sure that we can reconstruct who that user is.

We could add in additional randomness by regenerating identifiers (or the hash salt if you like in some cases) for every session, this way we could only track over a particular period of usage.

So in conclusion we have presented what is meant by primary and secondary data, stated the different between directly gathered data and indirectly gathered data and explained some of the issues relating to the usage of this data.

Now there are some additional cases such as certain kinds of reporting, for example, music player usage and DRM cases which don't always fall easily into the above categories unless we define some sub-categories to handle this. Maybe more of that later.


*apps are considered harmful