Monday, 7 January 2013

Tutorial: Tracking

Tracking and anonymisation are two critical aspects of information privacy and often quite misunderstood. We established what is meant by Personally Identifiable Information (PII) earlier but now I wish to progress a little further and discuss identifiers, tracking and anonymisation of data sets.
  • A data set is a collection of records containing information. 
  • A record is usually made of up of a number of individually named fields, though it could be a more complex structure such as a graph or tree for example. 
  • Each field contains some data from something as simple as a binary value, to a name, a number, a time-stamp, a picture or a video etc.
  • These fields are usually typed, for example: string, boolean, integer, VARCHAR, blob, media, something from the dc: namespace etc. However a field containing, say, a string could be further interpreted as a telephone number or a name. Some typing systems make distinctions such as a field storing a string to be interpreted as a telephone number explicit, others this is left to the interpretation by the reader.
  • Some fields in the data set's records are used to identify either aspects of that record, to correlate records together or to link to some external data. These fields we term identifiers.
Tracking is the ability to correlate information; often made in conjunction with some criteria such as a temporal, device or user identity dimension. The correlation is made according to one or more fields which act as identifiers, for example, user ID fields or IP addresses. The point is that we have a consistent identifier (or key) over the sets of data that we wish to relate or consider together. For example, given the following data set collected from some music service:

Key UserID Artist
1 Alice Queen
2 Alice Queen
3 Bob Rush
4 Bob Rush
5 Eve Spice Girls
6 Alice Queen
7 Bob Genesis
8 Eve Metallica

From this log we might want to track user behaviour to understand what music a particular user of our system likes listening to: we can see that Alice likes Queen, Bob is a fan of progressive rock and Eve has varied taste in music. This is possible because we have a consistent identifier (UserID field)  that any two instances of an entry refer to the same entity - the user. Furthermore the Key field allows us to make a distinction between two instances containing the same information which enables us to count individual entries: Alice played three songs, Bob three and Eve two. Additionally the Key field in this example may also have a temporal dimension such that we can infer the order in which songs were played.

The only required property of the identifier is that it be consistent over the records we wish to track. So if we change the above identifiers to their SHA-256 representations ("Alice" becomes 3bc51062973c458d5a6f2d8d64a023246354ad7e064b1e4e009ec8a0699a3043 ) we do not compromise our ability to track the behaviour of a user over that data set:

Key UserID (SHA-256 hashed) Artist
1 3bc5106...0699a3043 Queen
2 3bc5106...0699a3043 Queen
6 3bc5106...0699a3043 Queen

We can still make the same anlayses: 3bc...3043 likes Queen and played songs from that band three times. We have however obfuscated the user identifier, assuming that the user identifier had any meaning in the first place.

This latter point is important to note as it depends upon how we interpret the identifier. For example: 3bc5106...99a3043 has no "meaning" other than it being something we use to track over. The string "Alice" may have a meaning..."Alice" as 5 ASCII or Unicode characters are just as meaningful as our hashed value above. However "Alice" itself according to the typing information and usage in the data set is the identifier of the user in some system. Furthermore according to some interpretations "Alice" is a female name and this particular interpretation of this identifier's meaning might have additional impact.

In the above case we stated nothing about whether the strings "Alice", "Bob" and "Eve" actually were people's names nor whether these were linkable to real and unique people.We never really stated the semantics of the UserID field quite deliberately.

An example we can use to demonstrate this is that of the common practice of email-as-identifier. You can use your email address instead of (or as a) user name in  Facebook, G+ and other services. The following string can be interpreted in a number of (potentially) simultaneous ways:

zarquon.123@somesite.zyx
  • a string of 25 ASCII characters
  • an email address of a person/company/entity
  • a user ID for some service
  • a unique identifier linkable to a real person
and so on...

From a tracking point of view, "zarquon.123@somesite.zyx" has just as much meaning as "Alice" or "3bc5106...0699a3043" etc.

We however are now moving into the interpretation of the contents of a field and the semantics beyond that of an identifier which itself is independent of the actual form of the identifier itself. This leads us to the notion of linkability which we shall discuss later.

Tracking as we have hinted can be made more sophisticated through the addition of other identifiers such as IP addresses, device identifiers etc but this just makes the partitioning of the data set more complex and expand the possible internal cross-correlations but doesn't change the basic principle of tracking.

Some identifiers are more useful than others and much of this depends upon how linkable an identifier is to a person or device. For example, device identifiers such as IMEI are particularly useful, email addresses link to persons, IP addresses link to sometimes single computer or devices, sometimes multiple and can also be mapped to locations through the process of geolocation.

So that very briefly introduces what tracking is: simply the ability to correlate and collate sets of data together. The next step is to perform specific analyses over that data and to map those results back to the business and customer.

2 comments:

Privacy Maverick said...

Any progress on the tutorials?

Ian said...

Working on it, anyone like to suggest specific technical subjects? There are previous postings on primary and secondary data classification, generic classification structures to start with.