Tuesday 15 May 2012

Dimensional Analysis of Information


I’ve been looking at dimensional analysis as a technique to use for analyzing information flows, specifically for privacy.

After developing various taxonomies for information classification and some of the superstructure behind these (see here). One of the problems we have seen is trying to evaluate information content from ontologies and data schemata and deciding whether in ontology X, the field “name” has the same (or similar) semantics to a similarly named field in ontology Y.

Inspired by the technique of dimensional analysis, one idea is to consider each ontology as you would a system of units of measurement, eg: imperial units versus metric units. What dimensional analysis does is to abstract away from the units of measurement and into a small set of base or fundamental aspects. Typically these are length, mass and time denoted [L][M] and [T].

For example, acceleration has dimensions [L][T]-2  which using a system of measurements might be expressed as: metres per second per second or furlongs per day per aeon (just to mix things up).

When working with information systems and especially in the case of privacy where we need to classify information we can construct a set of “dimensions”. Choice of these is somewhat arbitrary – or at least they should have some aspect of orthogonality (I said this was inspired by dimensional analysis!).

The dimensions I chose were: Personal, Financial, Health, Time, Location, Identity and Content

Aside: To save on space an hint to the dimensional analysis inspiration, I'll use the first letter of the dimension name inside square brackets [ and ]...

We can have huge debates (and we did) about whether these are truly orthogonal and what happens when data elements or types are mapped to more than one 'dimension' - I don't think it matters too much at the moment, so let's put some of those difficulties aside.

Actually each of these is a top-level class in a taxonomy of information classification. For example, dimension the [P] breaks down into Demographics ([P_D]) and Contact ([P_C]), other classes follow similarly – as shown in the diagram below:

Information Type Taxonomy
Given a data schema we can map this schema into its dimensions in much the same was as done with physical quantities, for example the schema:

UserID x DeviceID x CellID x Timestamp x Age

Would be mapped to: [I]3[T][P]   meaning 3 units of identifiers, one of time and one of personal information. Actually as I stated earlier we actually have a hierarchy of dimensions so we might break down to: [I_P][I]2[T][P_D]  where [I_P] is a personal identifier and [P_D] is demographics, each being a sub classification of the [I] and [P].

The kinds of analysis that can be made are the quick identification of critical information content issues, such as in the above case we have a mix of identities which allows for potentially very precise identification. We have time involved which might allow profiling or tracking and an element of personal (demographic) information.

Furthermore we even have the chance that one identifier can be mapped to a location: CellIDs can easily be transformed into GPS coordinates and over time fairly easily be triangulated, especially when in an area densely populated by mobile base stations. Actually the above example could be mapped as [I]2[L][P] and indeed for a given data-schema being expressably in more than one dimensional form does raise some interesting concerns.

If we have some functions that process data, say a function that anonymises identities (we can have the discussion what anonymisation means later – please don’t mention hashing functions!) then application of this might result in our original dimensions [I]3[T][P] being mapped via that anonymisation function to [I]2[T][P] – an improvement in terms of moving towards anonymity maybe.

And so on....now whether this is really is dimensional analysis is another thing altogether, I doubt it largely in the current form and certainly I've made no major effort into properties of dimensional analysis such as commensurability or other mathematical properties. I'm also wondering if that other favorite of mine - entropy - can be put in here somewhere, as a coefficient to the dimension possibly? I think that might be taking things too far and is ultimately confusing concepts.

I've had some successes in terms of applying this to data-flow modelling of information flows and a couple of interesting results when we've discussed things such as legal consents, the application of a content to data and processing of that data. For example, take the humble IP address or CellID from the above example....the dimension of these is [I] (actually I have some subclass of Identity which deals with machine addresses), however both can be mapped to [L] fairly easily. Things such as expressing in a consent that we don't identify the source of information could mean mapping such things to other 'types' in other dimensions and actually end up not preserving privacy or even accidentally revealing more semantically interesting content...

No comments: