As discussed in part 1, we identified problems with the definition of data siloing resulting from the proliferation of individual service, app etc specific database.
These siloed databases have massive amounts of overlapping content but generally can not be semantically matched and used together. This is preventing consolidation and unification of the data which then leads not only to more proliferation and irreconcilable duplication of data but also actively prevents more expansive application and analytics of that data to be created.
To start solving this data siloing problem we investigate, initially, two aspects to enable de-isolation of data:
- linkability of identifiers
- interoperability of semantics
In this posting I'll concentrate on identifiers and discuss semantic interoperability in a later part.
Identifiers really mean the primary keys used over the plurality of sioled data sets. These come in a number of forms: user IDs, email (typically as username), device IDs, session IDs etc as well as encrypted, hashed and processed versions of these. Additionally structured or compound variants add to the complexity.
As a privacy person I could also mention so called anonymous data sets where identifiers can be inferred from other properties such as consistent locations over the data - something I tend to explain as the "2 location problem" where no personal identifiers are stored but a number of locations, eg: start and end points in navigation routes could be used as inferred identifiers.
Aside: Making this more interesting is profiling based upon a deeper, more semantic investigation on the contents irrespective of the key or identifier present in the database, cf: AOL search logs. We do not discuss this here at this time.
The main problems regarding linkability are:
- · the semantics of the identifier
- · the structure of the identifier
- · the representation of the identifier
The semantics of an identifier relate to which concepts that identifier represents. Taking unique user identifiers, for example, usernames, we need to understand how these relate to a person. It is often taken for granted that that user identifier equals unique person. Similarly with device identifiers and addresses such as IP addresses being equated with a single machine.
For example, we might have a structured or compound identifier containing a user ID which is matched against one device ID which is further composed of individual session IDs. We might also form a view of the real-world as shown in green. The red dashed lines show how we relate our identifier concepts with real-world concepts.
Note how what we a seemingly simple mapping now be complicated by other factors such as whether the device ID identifier refers to something the identifier user owns or uses. There is also an interesting mismatch between the multiplicities in the identifier structure and the real-world. We can argue that the above is a poor model of the real-world, but it serves the purpose to focus discussion on what we want the identifiers to actually identify.
Identifying the real-world concepts and then understanding how the identifiers’ semantics are grounded in these gives us our first clue into what can and cannot be successfully unified. This process has to be repeated for each individual data set or asset being considered and assurance sought that the semantics or real-world mappings do coincide sufficiently such that we can be sure that the pairs of identifiers are really referring to the same concepts, ie: they are both identifying the same things at the same level of granularity.
The structure of the identifier refers to whether that identifier acts as a compound key. Typically often seen is a mix of, say, user identifiers, device identifiers and session identifiers. While we might have identified a mix of many-to-many relationships between the real-world concepts, at this level we start to see some kind of invariants over that structure. Ideally this should refine the space of configurations of identifiers to real-world concepts.
Additionally we also have to look at the temporal aspect of the identifiers: does there exist a strong compositional structure versus a looser aggregate structure over time?
Note that we actually encounter the structure when working out the semantics, we present it second however to emphasise concentration on the semantics of the identifiers not their internal construction.
The representation of an identifier can cause some problems and we particularly refer here to obfuscated identifiers that have been transformed using hashing or encryption. Encrypted identifiers can always be reversed to reveal their original forms whereas cryptographic hashing is one-way. The latter should always be used with a suitable salt to add randomness to the hash. Doing this may turn an identifier into a kind of session identifier rather than one that identifies a real-world person or device – this depends greatly upon any regeneration.
When dealing with hashed identifiers we will find partial matches, typically when working with session identifiers. This leads to various questions about anonymity, especially when we can match the contents of the partial identifiers to "accidentally" reveal more of the structure. At worst we can limit the isolation of a data silo, at least to some internal level, for example, device or session only rather than a specific real-world person.
One might argue that we have addressed our concerns in some kind of reverse order; starting with semantics. However the key to understanding any information system is to understand what real-world concepts that information system is actually modeling and working “backwards” gives us the framework in which to perform our analysis of the linkability of identifiers.
Once identifiers in two data set or assets have linked based upon the correspondence of their representation, syntax and semantics then we have the initial unification.