Sunday, 10 June 2012

Semantic Isolation (Pt. 2½)

This is part 2.5, I'm reserving part 3 for the deeper semantic work and I needed to write some  notes after spending a week working on unifying a set of data-sets for our analytics teams -  extremely interesting and surprisingly challenging (in a good way!). This also has links with privacy and understanding what can and can not be linked and the semantics of those linkages is critical to enabling consumer privacy and compliance.

The unification of identifiers was presented in part 2 (link to part1 here too) as a way of  establishing links between two disparate data sets and breaking the data siloing that occurs information systems.

We start with the premise established earlier: The structure of the identifiers is considered a compound key to the records in that data set and understanding this structure is key to breaking the data siloing (see Apps Considered Harmful).

To give semantics (ostensibly a denotational semantics) to those identifiers we map these into "real world" structures which represent exactly to what those identifier should refer to. One of the discoveries here has been that it has been assumed that, for example, a person identifier always refers to a real- world person, or that a device identifier (eg: IMEI, IMSI) refers to an actual device and that there is a one-to-one correspondance between devices and people.

Note: this isn't necessarily a good model of the real-world, the question is that does this model suffice for the context and purpose to which it is being applied.

However common identifiers such as user IDs, IMEI, IMSI do not refer to persons and devices directly but often through artifacts such as SIM cards and a person's persona. Adding to this complexity is that the users and owners of devices change over time, and that we now have mobile devices which support multiple SIM cards. At any point in time we might construct a model of the real-world artifacts thus:

Typically analyics over these data sets - which is the driver for unification to enable  information consistency and quality over cross-referencing of multiple data sets - takes a  particular period in time, say 1 day to 3 months, so that we can dispense with dealing with  certain changes. The longer the analytical period the lower the data quality and there's an  interesting set of research that can be made there on measuring the quality loss.

So the main findings so far are:
  • It is given that each user identifier (user ID, email address) used for uniquely identifying  users is assumed to be a separate person.
  • It is generally assumed that equipment identifiers and addresses (IMEI, IMSI, IP address)  identify unique pieces of equipment
  • The relationship between a device and a person/user is 1-to-1
We have no notion of persona, in fact I've never seen any system with a good notion of  persona. Given two identifiers such as email address as used as username, then two email  addresses used by the same person are assumed to represent two persons. The typical use for  this is to allow a user to differentiate between two uses such as one for social purposes and  one for business purposes. The complicates of linking arises because of the strongly  directional nature of the person to persona relationship - in the discrete terms of UML and ER  modelling, this is a simple directional relationship of the form:

Aside: the notion of persona is best explained by Jung as 'a kind of mask, designed on the one  hand to make a definite impression upon others, and on the other to conceal the true nature of  the individual' which is found in Two Essays on Analytical Psychology. Probably we take a much  more crude view on persona but the principle is the same. There are other views of this such as those explained here: True self and false self.

Just to complicate the above we probably can not assume that a set of personas always refer to  the same person.

The situation with device identifiers and devices is remarkably similar and this is really being highlighted by the emergence of factors such as multiple-SIM, multiple device ownership and misguided collection of identifiers resulting in complicated and error-prone cross-referencing between the plurality of these from data sets of varying quality and information content.

Aside: we haven't dealt with information hiding through encryption and hashing yet and here we either rely upon knowing decryption keys, pattern or semantic matching or weak hashes and trivial salting.

For the moment we have the conclusion that the mapping between identifiers in data sets often do not match expectations of what it is assumed they actually identify and that assumed relationships between real-world artifacts are made implicitly without respect for the actual usage and meaning of those artifacts.

Just to finish (this might become part 2.75) that the data transformation processes:  filtering, abstraction etc over a data-set or extracted portion thereof often implicitly  refines the relationships in the identifier structures themselves and also at the semantics or  real-world level.

No comments: