Thursday, 28 June 2012

Raspberry Pi

My Raspberry Pi arrived, complete with T-shirt:

Now back to the days when computing was fun! Next on the shopping list: camera, a few motors, maybe a radio transmitter....

I was going to make a series of unboxing pictures, but there was little packaging and excitement meant that even that didn't last long.... :-)

Monday, 18 June 2012

Are we getting privacy wrong?

There seem to be a few articles about at the moment talking about privacy in terms of the consumer selling their data instead of it being just collected by the service provider for free (allegedly in some cases as kind of payment for a presumably free service). In particular the article that appeared in InfoWorld: The Next Consumerization Revolution by Galen Gruman and the presentation by Helen Nissenbaum a month or so ago (I wrote about this in the blog: Privacy, Dataflow and Nissenbaum ) typify a trend to the idea of a "data-mart" for consumer information.

First lets look at the main channels of communication between a user via an application (or [sic.] 'app') and a service. I tend to denote eight particular areas where data is either being transmitted or held as shown in the data-flow diagram below:

  1. The data transmitted to use a service, for example, if you are using a contacts book then your contact details (which might be private) will have to flow over this channel.
  2. The storage of your information, using the contacts book example this is where your contacts are stored
  3. Log data recorded because you used a service, for example Apache logs.  This will include identifiers such as machine addresses and browser identification strings as well as actions performed and resources accessed.
  4. Local data including primary and cached data, cookies etc.
  5. So called behavioural data which includes data on how you are using an application, device context etc. This is typically referred to as secondary data collection under EU Law.
  6. Storage of behavioural data
  7. Communication of data to support services such as login/authentication/federation services
  8. Log data recorded in the same manner as point 3 above but specific to the support service (point 7).
The above is described in terms of channels of information and a specific implementation might include numerous other channels each of the natures presented above. Each channel additionally supports multiple conversations and data stores are 'logical' in nature not physical (that's a refinement of the above).

I'm neglecting protecting the channels themselves - that's a different discussion and more security related but privacy of the channel itself is an important issue as was demonstrated by the BT-Phorm incident.

Now while many talk in terms of a simple data-mart the reality is more complicated in that, assuming that we have a trusted data market provider, in that we have to answer the question on which data we are actually trying to protect.

In the above data-flow there are really only two possible candidates as denoted by the data stores: 2, 3 and 6, respectively:
  • Log files: The kinds of data stored here really only has value in terms of finding out how the service is being used from a system administration perspective. Anonymisation can be made though depending upon how this is made profiling can be made over enough information.
  • Primary data: This along with behavioural data are the key data assets that are marketable. In the case of a contacts book then a user's social network or acquaintances can be constructed along with a multitude of other personal data. Much of this is often given up in order to get a better service, for example, LinkedIn requests permission to read your Gmail contacts in order to construct your set of professional contacts. In effect there is a kind of data-mart already existing here, however you are selling additional information in that your email contacts are probably in greater contact than from some other sources. To a point this data is not so valuable to the service provide but to the advertisers and marketers to construct a profile of you.
  • Behavioural data: This is probably the most valuable for the actual service provider in that gives the basis for profiling usage of a service and segmentation from there. Depending upon what is collected here, much of this may already be the same as that collected via logging (into storage 2). What makes this data asset valuable is that it might contain contextual information such as the status of a device at the time of collection.
To return to our question - "are we getting privacy wrong?" - if we ask a consumer rights or privacy activist then yes in that we're collecting data - something somewhat unavoidable if we wish to use a globally available, internet-based service. Of course we can restrict some data-flows, eg: flow 5, but that's about it. Going down this route the only solution eventually is not to use a service or to provide by default some Tor-like routing and anonymisation. If we ask the question in terms of what could be provided to the consumer then ostensibly we are getting privacy wrong.

The truth of the current focus of business and the ways in which consumers want to work is that data collection is unavoidable. Given the above points then it becomes clear that preventing data collection is thus similarly unavoidable and current mechanisms for selective and coarse-grained data collection are not delivering on the privacy ideal.

A move towards the "data-mart" is probably going to be the best solution, though it does complicate the consumers' interactions with required services. Furthermore, while the business model makes sense, though not necessarily financially at this time, the infrastructure in terms of physical provision of such data-mart services and the necessary developments in anonymisation and identification of data are lacking.

Another aspect that worries me is that while this protects the consumer and directly monetarises their information, it still does not sufficiently prevent further usage after release - something that is a problem now given the interlinking between current information holders. A data-mart approach would only deal with the first level in any case and lead to some interesting issues when the expectation of privacy is compromised outside of the data-mart level.

Finally, what form does this "data-mart" take? Is it a centralised, information proxy over a user's data stores and flows, or, is it a case-by-case contract with service providers? Are there going to be graduations in what is montarised - more data = more service?

Saturday, 16 June 2012

Cohomology for the layman (?)

Noticed this article over at n-Category Cafe: Cohomology in Everyday Life, where the author David Corfield asks for examples from everyday life to explain cohomology to a layman.

Given that the definition of cohomology (according to Wikipedia*)
In mathematics, specifically in algebraic topology, cohomology is a general term for a sequence of abelian groups defined from a co-chain complex. That is, cohomology is defined as the abstract study of cochains, cocycles, and coboundaries. Cohomology can be viewed as a method of assigning algebraic invariants to a topological space that has a more refined algebraic structure than does homology. Cohomology arises from the algebraic dualization of the construction of homology. In less abstract language, cochains in the fundamental sense should assign 'quantities' to the chains of homology theory.
then he's got quite a challenge I think. However, of his current list of examples carrying in arithmetic is a good one and something that is familiar to everyone. If you're up to it then the mentioned paper by Daniel Isaksen: A Cohomological Viewpoint on Elementary School Arithmetic [1] is great reading.

Anyway, I look forward to what he finally presents - especially if he's going to present something about entropy which would give a nice link with information theory and, as we're talking about presenting to the layman, information privacy.


[1] Daniel C. Isaksen. (2002) A Cohomological Viewpoint on Elementary School Arithmetic. The American Mathematical Monthly, Vol. 109, No. 9. (Nov., 2002), pp. 796-805

*usual disclaimers apply of course.

Monday, 11 June 2012

Information engineers wanted

Continuing on the theme of semantic isolation and data siloing* somewhat, I was listening to the Diane Rhem Show on US talk radio station NPR via Finland's YLE Mondo. Today's programme "New Voter ID Laws and the 2012 Elections" was about the USA's law regarding voter eligability and the problems of keeping track of who is and who isn't allowed to vote in certain elections (2012 presidential election being of particular concern).

One part of the programme concentrated on the difficulty of cross-referencing between voting lists in states, counties and various government bodies such as the driver and vehicle licensing (DMV). One of the major problems is that actually identifying and subsequently cross-refenencing people by ID numbers (plural!), by name (due to misspellings, usually involving punctuation) or by address and location. Making this even more interesting is the temporal aspect that over time people move and records in differing data-sets end up overlapping or being temporally disjoint.

One of the current solutions is to use what was termed "election geeks" - people with highly detailed knowledge of voter lists and how to match records in different formats from different states and agencies together. That is a group of people who are highly skilled in performing the manual task of deisolating the semantics of each data set (of voters) and matching these together.

One of the presenters remarked that while a technological solution was necessary we need more so called election geeks. Putting it another way, we need more highly skilled engineers specialising in information, information theory, semantics and ultimlately semiotics. What a great idea!

*this links to part 1, parts: 2 and 2 and a half.

Sunday, 10 June 2012

Semantic Isolation (Pt. 2½)

This is part 2.5, I'm reserving part 3 for the deeper semantic work and I needed to write some  notes after spending a week working on unifying a set of data-sets for our analytics teams -  extremely interesting and surprisingly challenging (in a good way!). This also has links with privacy and understanding what can and can not be linked and the semantics of those linkages is critical to enabling consumer privacy and compliance.

The unification of identifiers was presented in part 2 (link to part1 here too) as a way of  establishing links between two disparate data sets and breaking the data siloing that occurs information systems.

We start with the premise established earlier: The structure of the identifiers is considered a compound key to the records in that data set and understanding this structure is key to breaking the data siloing (see Apps Considered Harmful).

To give semantics (ostensibly a denotational semantics) to those identifiers we map these into "real world" structures which represent exactly to what those identifier should refer to. One of the discoveries here has been that it has been assumed that, for example, a person identifier always refers to a real- world person, or that a device identifier (eg: IMEI, IMSI) refers to an actual device and that there is a one-to-one correspondance between devices and people.

Note: this isn't necessarily a good model of the real-world, the question is that does this model suffice for the context and purpose to which it is being applied.

However common identifiers such as user IDs, IMEI, IMSI do not refer to persons and devices directly but often through artifacts such as SIM cards and a person's persona. Adding to this complexity is that the users and owners of devices change over time, and that we now have mobile devices which support multiple SIM cards. At any point in time we might construct a model of the real-world artifacts thus:

Typically analyics over these data sets - which is the driver for unification to enable  information consistency and quality over cross-referencing of multiple data sets - takes a  particular period in time, say 1 day to 3 months, so that we can dispense with dealing with  certain changes. The longer the analytical period the lower the data quality and there's an  interesting set of research that can be made there on measuring the quality loss.

So the main findings so far are:
  • It is given that each user identifier (user ID, email address) used for uniquely identifying  users is assumed to be a separate person.
  • It is generally assumed that equipment identifiers and addresses (IMEI, IMSI, IP address)  identify unique pieces of equipment
  • The relationship between a device and a person/user is 1-to-1
We have no notion of persona, in fact I've never seen any system with a good notion of  persona. Given two identifiers such as email address as used as username, then two email  addresses used by the same person are assumed to represent two persons. The typical use for  this is to allow a user to differentiate between two uses such as one for social purposes and  one for business purposes. The complicates of linking arises because of the strongly  directional nature of the person to persona relationship - in the discrete terms of UML and ER  modelling, this is a simple directional relationship of the form:

Aside: the notion of persona is best explained by Jung as 'a kind of mask, designed on the one  hand to make a definite impression upon others, and on the other to conceal the true nature of  the individual' which is found in Two Essays on Analytical Psychology. Probably we take a much  more crude view on persona but the principle is the same. There are other views of this such as those explained here: True self and false self.

Just to complicate the above we probably can not assume that a set of personas always refer to  the same person.

The situation with device identifiers and devices is remarkably similar and this is really being highlighted by the emergence of factors such as multiple-SIM, multiple device ownership and misguided collection of identifiers resulting in complicated and error-prone cross-referencing between the plurality of these from data sets of varying quality and information content.

Aside: we haven't dealt with information hiding through encryption and hashing yet and here we either rely upon knowing decryption keys, pattern or semantic matching or weak hashes and trivial salting.

For the moment we have the conclusion that the mapping between identifiers in data sets often do not match expectations of what it is assumed they actually identify and that assumed relationships between real-world artifacts are made implicitly without respect for the actual usage and meaning of those artifacts.

Just to finish (this might become part 2.75) that the data transformation processes:  filtering, abstraction etc over a data-set or extracted portion thereof often implicitly  refines the relationships in the identifier structures themselves and also at the semantics or  real-world level.

Tuesday, 5 June 2012

Semantic Isolation (Pt.2)

As discussed in part 1, we identified problems with the definition of data siloing resulting from the proliferation of individual service, app etc specific database. 

These siloed databases have massive amounts of overlapping content but generally can not be semantically matched and used together. This is preventing consolidation and unification of the data which then leads not only to more proliferation and irreconcilable duplication of data but also actively prevents more expansive application and analytics of that data to be created.

To start solving this data siloing problem we investigate, initially, two aspects to enable de-isolation of data:
  • linkability of identifiers
  • interoperability of semantics
In this posting I'll concentrate on identifiers and discuss semantic interoperability in a later part.

Identifiers really mean the primary keys used over the plurality of sioled data sets. These come in a number of forms: user IDs, email (typically as username), device IDs, session IDs etc as well as encrypted, hashed and processed versions of these. Additionally structured or compound variants add to the complexity.

As a privacy person I could also mention so called anonymous data sets where identifiers can be inferred from other properties such as consistent locations over the data - something I tend to explain as the "2 location problem" where no personal identifiers are stored but a number of locations, eg: start and end points in navigation routes could be used as inferred identifiers.

Aside: Making this more interesting is profiling based upon a deeper, more semantic investigation on the contents irrespective of the key or identifier present in the database, cf: AOL search logs. We do not discuss this here at this time.

The main problems regarding linkability are:
  • ·         the semantics of the identifier
  • ·         the structure of the identifier
  • ·         the representation of the identifier
The semantics of an identifier relate to which concepts that identifier represents. Taking unique user identifiers, for example, usernames, we need to understand how these relate to a person. It is often taken for granted that that user identifier equals unique person. Similarly with device identifiers and addresses such as IP addresses being equated with a single machine.

For example, we might have a structured or compound identifier containing a user ID which is matched against one device ID which is further composed of individual session IDs. We might also form a view of the real-world as shown in green. The red dashed lines show how we relate our identifier concepts with real-world concepts.

Note how what we a seemingly simple mapping now be complicated by other factors such as whether the device ID identifier refers to something the identifier user owns or uses. There is also an interesting mismatch between the multiplicities in the identifier structure and the real-world. We can argue that the above is a poor model of the real-world, but it serves the purpose to focus discussion on what we want the identifiers to actually identify.

Identifying the real-world concepts and then understanding how the identifiers’ semantics are grounded in these gives us our first clue into what can and cannot be successfully unified. This process has to be repeated for each individual data set or asset being considered and assurance sought that the semantics or real-world mappings do coincide sufficiently such that we can be sure that the pairs of identifiers are really referring to the same concepts, ie: they are both identifying the same things at the same level of granularity.

The structure of the identifier refers to whether that identifier acts as a compound key. Typically often seen is a mix of, say, user identifiers, device identifiers and session identifiers. While we might have identified a mix of many-to-many relationships between the real-world concepts, at this level we start to see some kind of invariants over that structure. Ideally this should refine the space of configurations of identifiers to real-world concepts.

Additionally we also have to look at the temporal aspect of the identifiers: does there exist a strong compositional structure versus a looser aggregate structure over time?

Note that we actually encounter the structure when working out the semantics, we present it second however to emphasise concentration on the semantics of the identifiers not their internal construction.

The representation of an identifier can cause some problems and we particularly refer here to obfuscated identifiers that have been transformed using hashing or encryption. Encrypted identifiers can always be reversed to reveal their original forms whereas cryptographic hashing is one-way. The latter should always be used with a suitable salt to add randomness to the hash. Doing this may turn an identifier into a kind of session identifier rather than one that identifies a real-world person or device – this depends greatly upon any regeneration.

When dealing with hashed identifiers we will find partial matches, typically when working with session identifiers. This leads to various questions about anonymity, especially when we can match the contents of the partial identifiers to "accidentally" reveal more of the structure. At worst we can limit the isolation of a data silo, at least to some internal level, for example, device or session only rather than a specific real-world person.

One might argue that we have addressed our concerns in some kind of reverse order; starting with semantics. However the key to understanding any information system is to understand what real-world concepts that information system is actually modeling and working “backwards” gives us the framework in which to perform our analysis of the linkability of identifiers.

Once identifiers in two data set or assets have linked based upon the correspondence of their representation, syntax and semantics then we have the initial unification.

Saturday, 2 June 2012

More on "Apps considered harmful"

Just a heads up that the abstract to Ora's keynote at CIDOC 2012 in Helsinki in up on the CIDOC pages at the following address:

It is entitled: Love Thy Data (or: Apps Considered Harmful) and I've seen the slide-set...something really worth waiting for and when it becomes publicly available I'll post the link here as well.