Wednesday, 30 May 2012

Semantic Isolation (Pt.1)

Working with Ora Lassila we have been discussing and working on the definition of the term "data silo" in order to clarify our ideas of semantic isolation when applied to databases, data assets and the interoperability and integration of the information contained within.

The term "silo" when applied to databases, data and information is interesting in that it occurs in a number of statements, such as, "my app data is siloed" or "we need to break the data silos" and so on.

The meaning of the term however is mixed in its usage and thus its usage is inconsistent and misleading in many cases; quite simply it is used to cover a large number of overlapping scenarios.

Understanding the scope and meaning of this term in its various contexts is central to understanding the interoperability problem in a practical sense. In a workshop today I have heard the term used in a large number of ways and also applied to the notion of interoperability: The term "silo" has been used to mean (at least!)
  • The data is siloed because it exists in its own database infrastructure
  • The data is siloed because it is on accessible via some access control
  • The data is siloed because it is in its own representation format
  • The data is siloed because it is not understandable/translatable (semantics)
We can present these are some kind of "lock-in" or "siloing continuum", where those usages on the left are more related to physical aspects and those on the right to more semantic in the information sense:

We obviously can create a more granular continuum (indeed that's what a continuum should allow) but the point here is to at least to present some kind of ordering over the differing uses of the term. The ordering runs from physical deployment and implementation through to abstract semantics.

Now it seems that when people talk about "breaking the [data] silos" they are actually referring to enabling interoperability of the data between differing services; and often this is addressed at the physical database or access control level. Occasionally the discussion gets mixed and syntax and representation of data is addressed.

Interoperability of information starts at the semantic level and works in reverse (right to left) through the above continuum; physical, logical, access control and syntax should not prevent sharing and common understanding of data. For example, if one tackles interoperability of information by standarising on syntax or representation (eg: JSON vs XML) then the resultant will be two sets of data that can't be merged because they don't have the same meaning; similarly at the other end of the continuum centralising databases (physically or logically) doesn't result in interoperability - maybe easier system management but never interoperability of information.

Interestingly I had an extremely interesting discussion about financial systems and that interoperability between these is extremely high even at the application (local usage) level and this is simply because the underlying semantics of any financial system is unified. The notions of profit, loss, debit, credit and translations between the meanings of things such as dollars, yen, euros, pounds and the mathematics of financial values is formally defined and unambiguously understood; even if the mechanics if financial and economic systems isn't, but that's a different aspect altogether.

Also an important points here is that the link between financial concepts to real-world concepts and objects is well relatively easily definable. Indeed probably all real-world concepts and objects have their semantics defined in terms of financial transactions and concepts. Thus siloing of data probably can only occur in the financial world at the access control level.

The requirements for breaking the silos is easily understood as the ability to cross-reference two different data-sets and be sure (within certain bounds) that the meaning of the information contained there is is compatible. We want to perform things such as "1 + one equals 2" and be sure that the concept of "one" is the same as "1", the definition of "+" matches the concept of "+" applied to things such as "1","2" etc as well as things such as "one","two" etc. In this case the common semantics of "1" and "one" has been defined...fortunately.

It is vitally important to understand that if we can unify data sets through translations via common semantics then the siloing of data breaks and we get data liberation or what some call data democratisation. Unification of semantics however is faught with difficulties [1] but is the key prerequisite to integration and interoperability and ultimately a more expansive usage of that information.


[1]Ian Oliver, Ora Lassila (2011). Integration "In The Large". Position paper accepted at the W3C Workshop on Data and Services Integration, October 20-21 2011, Bedford, MA, USA


Ora said...

This idea of "working from right to left" is important. People easily miss that one. said...

Excellent post, and I agree with Ora that the idea of starting from semantics is critical.

One thing missing from the post is the role that unique identifiers play in unifying semantics between systems. Yes, financial systems are readily integrated, but only when a particular system of entity identification has been agreed upon, and/or equivalence between legal entity identifiers as been established.

Ian Oliver said...

Identifiers are coming in part 2; which was supposed to be part 1 until I realised I had to define the term silo first...the identifiers stuff is *really* fun...!