I’ve been looking at dimensional analysis
as a technique to use for analyzing information flows, specifically for privacy.
After developing various taxonomies for information classification and some of the superstructure behind these (see here). One of the problems we have seen is trying to evaluate information content from ontologies and data schemata and deciding whether in ontology X, the field “name” has the same (or similar) semantics to a similarly named field in ontology Y.
After developing various taxonomies for information classification and some of the superstructure behind these (see here). One of the problems we have seen is trying to evaluate information content from ontologies and data schemata and deciding whether in ontology X, the field “name” has the same (or similar) semantics to a similarly named field in ontology Y.
Inspired by the technique of dimensional
analysis, one idea is to consider each ontology as you would a system of units
of measurement, eg: imperial units versus metric units. What dimensional
analysis does is to abstract away from the units of measurement and into a
small set of base or fundamental aspects. Typically these are length, mass and
time denoted [L][M] and [T].
For example, acceleration has dimensions
[L][T]-2 which using a system of
measurements might be expressed as: metres per second per second or furlongs
per day per aeon (just to mix things up).
When working with information systems and
especially in the case of privacy where we need to classify information we can
construct a set of “dimensions”. Choice of these is somewhat arbitrary – or at
least they should have some aspect of orthogonality (I said this was inspired
by dimensional analysis!).
The dimensions I chose were: Personal,
Financial, Health, Time, Location, Identity and Content.
Aside: To save on space an hint to the
dimensional analysis inspiration, I'll use the first letter of the
dimension name inside square brackets [ and ]...
We can have huge debates (and we did) about whether these are truly orthogonal and what happens when data elements or types are mapped to more than one 'dimension' - I don't think it matters too much at the moment, so let's put some of those difficulties aside.
Actually each of these
is a top-level class in a taxonomy of information classification. For example,
dimension the [P] breaks down into Demographics ([P_D]) and Contact ([P_C]), other classes follow similarly – as shown in the diagram
below:
Information Type Taxonomy |
Given a data schema we can map this
schema into its dimensions in much the same was as done with physical
quantities, for example the schema:
UserID x DeviceID x CellID x Timestamp x
Age
Would be mapped to: [I]3[T][P] meaning 3 units of identifiers, one of time
and one of personal information. Actually as I stated earlier we actually have
a hierarchy of dimensions so we might break down to: [I_P][I]2[T][P_D] where [I_P] is a personal identifier and [P_D] is
demographics, each being a sub classification of the [I] and [P].
The kinds of analysis that can be made
are the quick identification of critical information content issues, such as in
the above case we have a mix of identities which allows for potentially very precise
identification. We have time involved which might allow profiling or tracking
and an element of personal (demographic) information.
Furthermore we even have the chance that
one identifier can be mapped to a location: CellIDs can easily be transformed
into GPS coordinates and over time fairly easily be triangulated, especially
when in an area densely populated by mobile base stations. Actually the above
example could be mapped as [I]2[L][P] and indeed for a given data-schema being
expressably in more than one dimensional form does raise some interesting
concerns.
If we have some functions that process
data, say a function that anonymises identities (we can have the discussion
what anonymisation means later – please don’t mention hashing functions!) then application
of this might result in our original dimensions [I]3[T][P] being mapped via that anonymisation function to
[I]2[T][P] – an improvement in terms of moving towards anonymity maybe.
And so on....now whether this is really is dimensional analysis is another thing altogether, I doubt it largely in the current form and certainly I've made no major effort into properties of dimensional analysis such as commensurability or other mathematical properties. I'm also wondering if that other favorite of mine - entropy - can be put in here somewhere, as a coefficient to the dimension possibly? I think that might be taking things too far and is ultimately confusing concepts.
I've had some successes in terms of applying this to data-flow modelling of information flows and a couple of interesting results when we've discussed things such as legal consents, the application of a content to data and processing of that data. For example, take the humble IP address or CellID from the above example....the dimension of these is [I] (actually I have some subclass of Identity which deals with machine addresses), however both can be mapped to [L] fairly easily. Things such as expressing in a consent that we don't identify the source of information could mean mapping such things to other 'types' in other dimensions and actually end up not preserving privacy or even accidentally revealing more semantically interesting content...
I've had some successes in terms of applying this to data-flow modelling of information flows and a couple of interesting results when we've discussed things such as legal consents, the application of a content to data and processing of that data. For example, take the humble IP address or CellID from the above example....the dimension of these is [I] (actually I have some subclass of Identity which deals with machine addresses), however both can be mapped to [L] fairly easily. Things such as expressing in a consent that we don't identify the source of information could mean mapping such things to other 'types' in other dimensions and actually end up not preserving privacy or even accidentally revealing more semantically interesting content...
No comments:
Post a Comment