Sunday 21 July 2013

Big Metadata

One of the issues I see with auditing systems for privacy compliance is actually understanding what data they are holding. Often it is the case that the project teams themselves don't understand their databases and log files sufficiently. Worse is that misinterpretation of the NoSQL and BigData approaches have left us in a situation where schemata can be forgotten - or at least defined implicitly at run-time. The dogma is that relational databases have failed and all this "non-agile", predefined, waterfall, defined schemata stuff is a major part of this.

Losing all this information about types and semantics is a huge problem because no longer can be fully sure of the consistency and integrity of the data and the relationships of that data to other objects and structures.

We are also then losing the opportunity to add additional information in the form of aspects to the data, for example, security classifications, broad usage classifications, and so on. This leads to embedding much of the information about the data statically into the algorithms that operation over that data; which in turn hides the meaning of the data away from the data itself.

I think this article entitled "Big Data success needs Big Metadata" by Simon James Gratton of CapGemini sums it quite quite well: forgetting about the meaning of data will seriously compromise our ability to understand, use and integrate the data in the first place!

To achieve this, good old fashioned data classification and cataloguing is required. Ironically this is exactly the stuff that database developers used to do before the onset of the trend to making everything schema-free.

Together with suitably defined aspects and ontologies that describe information (meta-information?) in much the same way as the OMG's MOF with additional structure and semantics we already have much, if not all, of the required infrastructure.

Then the process side of things needs to ensure that development of systems, services, analytics and applications integrates with this - in that whatever data they store (locally or in the cloud) gets recorded (and updated!). That's probably the hardest part!

See also:

No comments: