As discussed in part 1, we identified problems
with the definition of data siloing resulting from the proliferation of individual service, app etc specific database.
These siloed databases have massive amounts of overlapping content but generally can not be semantically matched and used together. This is preventing consolidation and unification of the data which then leads not only to more proliferation and irreconcilable duplication of data but also actively prevents more expansive application and analytics of that data to be created.
To start solving this data siloing problem we investigate,
initially, two aspects to enable de-isolation of data:
- linkability of identifiers
- interoperability of semantics
In this posting I'll concentrate on
identifiers and discuss semantic interoperability in a later part.
Identifiers really mean the primary keys
used over the plurality of sioled data sets. These come in a number of forms:
user IDs, email (typically as username), device IDs, session IDs etc as well as
encrypted, hashed and processed versions of these. Additionally structured or compound
variants add to the complexity.
As a privacy person I could also mention so called anonymous data sets where identifiers can be inferred from other properties such as consistent locations over the data - something I tend to explain as the "2 location problem" where no personal identifiers are stored but a number of locations, eg: start and end points in navigation routes could be used as inferred identifiers.
Aside: Making this more interesting is
profiling based upon a deeper, more semantic investigation on the contents
irrespective of the key or identifier present in the database, cf: AOL search logs. We do not discuss this here at this time.
The main problems regarding linkability
are:
- · the semantics of the identifier
- · the structure of the identifier
- · the representation of the identifier
The semantics
of an identifier relate to which concepts that identifier represents. Taking unique
user identifiers, for example, usernames, we need to understand how these
relate to a person. It is often taken for granted that that user identifier
equals unique person. Similarly with device identifiers and addresses such as
IP addresses being equated with a single machine.
For example, we might have a structured or compound identifier containing a user ID which is matched against one device ID which is further composed of individual session IDs. We might also form a view of the real-world as shown in green. The red dashed lines show how we relate our identifier concepts with real-world concepts.
Note how what we a seemingly simple mapping now be complicated by other factors such as whether the device ID identifier refers to something the identifier user owns or uses. There is also an interesting mismatch between the multiplicities in the identifier structure and the real-world. We can argue that the above is a poor model of the real-world, but it serves the purpose to focus discussion on what we want the identifiers to actually identify.
Identifying the real-world concepts and
then understanding how the identifiers’ semantics are grounded in these gives
us our first clue into what can and cannot be successfully unified. This
process has to be repeated for each individual data set or asset being
considered and assurance sought that the semantics or real-world mappings do coincide sufficiently such that we can be sure that the pairs of identifiers are really referring to the same concepts, ie: they are both identifying the same things at the same level of granularity.
The structure
of the identifier refers to whether that identifier acts as a compound key.
Typically often seen is a mix of, say, user identifiers, device identifiers and
session identifiers. While we might have identified a mix of many-to-many
relationships between the real-world concepts, at this level we start to see
some kind of invariants over that structure. Ideally this should refine the
space of configurations of identifiers to real-world concepts.
Additionally we also have to look at the
temporal aspect of the identifiers: does there exist a strong compositional
structure versus a looser aggregate structure over time?
Note that we actually encounter the structure when working out the semantics, we present it second however to emphasise concentration on the semantics of the identifiers not their internal construction.
The representation
of an identifier can cause some problems and we particularly refer here to obfuscated
identifiers that have been transformed using hashing or encryption. Encrypted
identifiers can always be reversed to reveal their original forms whereas cryptographic
hashing is one-way. The latter should always be used with a suitable salt to
add randomness to the hash. Doing this may turn an identifier into a kind of
session identifier rather than one that identifies a real-world person or
device – this depends greatly upon any regeneration.
When dealing with hashed identifiers we will find partial matches, typically when working with session identifiers. This leads to various questions about anonymity, especially when we can match the contents of the partial identifiers to "accidentally" reveal more of the structure. At worst we can limit the isolation of a data silo, at least to some internal level, for example, device or session only rather than a specific real-world person.
One might argue that we have addressed
our concerns in some kind of reverse order; starting with semantics. However
the key to understanding any information system is to understand what real-world
concepts that information system is actually modeling and working “backwards”
gives us the framework in which to perform our analysis of the linkability of
identifiers.
Once identifiers in two data set or assets have
linked based upon the correspondence of their representation, syntax and
semantics then we have the initial unification.
No comments:
Post a Comment