Thursday, 19 September 2013

Where are all the Information Scientists?

What happened to all the scientists and engineers with a real deep understanding of what information is? We're now apparently firmly in the era of Big Data and yet most of the Big Data implementations seem to stop be building large Hadoop clusters and running modified versions of the canonical word counting programs over log files from Apache.

I had the pleasure of working with a group of people who were blessed with PhDs in mathematics, physics and statistics and yet they complained bitterly about the lack of good, interesting and relevant data in order to properly ply their skills and generate valuable insights.

I know of one person who is utilising some quite cool natural language processing over massive data sets to extract all kinds of context, and yet no-one seems to be interested in what this is achieving.

The current Gartner Hype Curve for Big Data has us on the edge of the precipice into the Trough of Disillusionment and once again when the negative press starts everyone will be surprised. Worse is that adopters of Big Data and analytics will not question why it went wrong but rather blame Big Data for not delivering.

One of the reasons why Big Data is heading for this "failure" is because we do not understand, or more accurately want to understand our data. We do not want to put the investment into understanding the subtle and complex interactions between the various aspects of our data, the classification structures, the processing etc.

Indeed I find much of the current trend to NoSQL as being an excuse for ignoring the very basics of understanding data. Apparently SQL is bad: I'm told that SQL means static table structures, fixed schema, inflexible query languages and all that relational theory stuff about data organisation is obsolete. and irrelevant. Multi terabyte "traditional" databases are uncool, cloudified Linux clusters in AWS running Hadoop and canonical word counting programs over web log files is COOL? At one level we've just made OS/360 and COBOL more complicated without doing much more than was possible 30 years ago.

Have we really forgotten how to manage and construct our data so that it becomes information? Or, as I also worry about, have we lost the ability to understand what our data could tell us and been blinded by the hype behind a technology? Not that it hasn't happened before of course.

Much of our emphasis is on building products and getting cool apps to users; sometimes we get data out of these apps either by primary or secondary means. Surprisingly little thought is placed on what data and how we could really understand that data and even less on the interactions between different datasets and what that could tell us.

Rather than understanding the context in which our information exists we've become obsessed with generating abstract counts of things which translate easily into bar charts on a PowerPoint slide. We're failing at extracting the deeper context.

It's fine to learn that an application is being used by one million users this week and two million the next, but does this really tell you about how people are using that app, who is using that app, why, where and what for? Even more importantly we don't even consider the inverse of the questions, why aren't people using my app?

Consider air miles or store cards - when was the last time an airline or supermarket contacted you to ask you why or why not you are flying/shopping with them? I can guarantee that the airline or supermarkets has number of how many things were bought or flights flown...hardly Big Data and deep analytics is it?

To solve this we really do need the return of the information scientist - someone who understands data, information and knowledge, who understands taxonomies, ontologies and the mathematical underpinnings of these. He or she also needs to know how to implement these, how best to organise a database, a data structure whether it be physically implemented in a triple store, relational database, flat log file etc...

For many this then rears the spectre of data governance and huge data management processes that slow the business down - you're not being agile are you then? Big Process is often the result of not understanding or not wanting to understand how to do things. If you have a number of apps and services wouldn't it be better to harmonise the information collection and ensure that the schemata and underlying semantics are consistent over all your collected data? Surely spending less time on data quality and consistency because quality and consistency are already inherent in your data is a much better use of your resources.

So, if you're having arguments over which NoSQL database to use, or whether triple stores and graph databases have the performance you require or even if Semantic Web is just an academic construct, then you're certainly not doing Big Data and the knowledge you will gain from the data you are capturing will only be supporting your business in the most ephemeral of ways.

No comments: