Wednesday 30 May 2012

Semantic Isolation (Pt.1)

Working with Ora Lassila we have been discussing and working on the definition of the term "data silo" in order to clarify our ideas of semantic isolation when applied to databases, data assets and the interoperability and integration of the information contained within.

The term "silo" when applied to databases, data and information is interesting in that it occurs in a number of statements, such as, "my app data is siloed" or "we need to break the data silos" and so on.

The meaning of the term however is mixed in its usage and thus its usage is inconsistent and misleading in many cases; quite simply it is used to cover a large number of overlapping scenarios.

Understanding the scope and meaning of this term in its various contexts is central to understanding the interoperability problem in a practical sense. In a workshop today I have heard the term used in a large number of ways and also applied to the notion of interoperability: The term "silo" has been used to mean (at least!)
  • The data is siloed because it exists in its own database infrastructure
  • The data is siloed because it is on accessible via some access control
  • The data is siloed because it is in its own representation format
  • The data is siloed because it is not understandable/translatable (semantics)
We can present these are some kind of "lock-in" or "siloing continuum", where those usages on the left are more related to physical aspects and those on the right to more semantic in the information sense:




We obviously can create a more granular continuum (indeed that's what a continuum should allow) but the point here is to at least to present some kind of ordering over the differing uses of the term. The ordering runs from physical deployment and implementation through to abstract semantics.

Now it seems that when people talk about "breaking the [data] silos" they are actually referring to enabling interoperability of the data between differing services; and often this is addressed at the physical database or access control level. Occasionally the discussion gets mixed and syntax and representation of data is addressed.

Interoperability of information starts at the semantic level and works in reverse (right to left) through the above continuum; physical, logical, access control and syntax should not prevent sharing and common understanding of data. For example, if one tackles interoperability of information by standarising on syntax or representation (eg: JSON vs XML) then the resultant will be two sets of data that can't be merged because they don't have the same meaning; similarly at the other end of the continuum centralising databases (physically or logically) doesn't result in interoperability - maybe easier system management but never interoperability of information.

Interestingly I had an extremely interesting discussion about financial systems and that interoperability between these is extremely high even at the application (local usage) level and this is simply because the underlying semantics of any financial system is unified. The notions of profit, loss, debit, credit and translations between the meanings of things such as dollars, yen, euros, pounds and the mathematics of financial values is formally defined and unambiguously understood; even if the mechanics if financial and economic systems isn't, but that's a different aspect altogether.

Also an important points here is that the link between financial concepts to real-world concepts and objects is well relatively easily definable. Indeed probably all real-world concepts and objects have their semantics defined in terms of financial transactions and concepts. Thus siloing of data probably can only occur in the financial world at the access control level.

The requirements for breaking the silos is easily understood as the ability to cross-reference two different data-sets and be sure (within certain bounds) that the meaning of the information contained there is is compatible. We want to perform things such as "1 + one equals 2" and be sure that the concept of "one" is the same as "1", the definition of "+" matches the concept of "+" applied to things such as "1","2" etc as well as things such as "one","two" etc. In this case the common semantics of "1" and "one" has been defined...fortunately.

It is vitally important to understand that if we can unify data sets through translations via common semantics then the siloing of data breaks and we get data liberation or what some call data democratisation. Unification of semantics however is faught with difficulties [1] but is the key prerequisite to integration and interoperability and ultimately a more expansive usage of that information.


References:

[1]Ian Oliver, Ora Lassila (2011). Integration "In The Large". Position paper accepted at the W3C Workshop on Data and Services Integration, October 20-21 2011, Bedford, MA, USA


Sunday 27 May 2012

Writing

A friend of mine - David Cord - has written on his blog about writing, or more importantly, how to write and keep writing. Great advice from a famous author. I'll highlight a few paragraphs of his wisdom on the matter:

I write during the day, during working hours, and the best stuff always comes in the morning. I write almost every day, even weekends and holidays. Before I start, 

I have to prepare myself. I check my email, check the stock market, read the news, and get everything out of the way that I know could draw my attention before I start.

Next, I remove all possible distractions. The internet gets turned off, and the phone gets put on silent and placed in another room. It takes a long time for me to get properly focused, but one little distraction immediately throws me off track. One text message could destroy fifteen or thirty minutes of writing time. Sometimes, if I feel weak, I will even close the curtains so I’m not tempted to look outside.

I was thinking about this very thing recently and deciding that the work office is a bad place to write, open offices even more so (actually does any work get done in an open office environment?)...the best place? Somewhere with some noise with comfort and coffee/tea or chai latte (yep, Starbucks), seems like I'm just like Ernest Hemingway in this respect, if not the writing - just need to find that perfect cafe.

David doesn't talk about tools - meaning software tools - for writing (and beating writer's block). The problem with things such as Microsoft Word is that they distract you with things such as formatting the text. I find simple tools without complications such as spelling and grammar checkers such as vi (and for some emacs). Using a simple text editor you can just focus on the text itself, though with the problem that you do need to leave the editing environment to deal with graphics; though here I tend just to leave a note in the text and draw the diagram or mathematics using pen (ink pen) in my engineering notebook.

Saturday 26 May 2012

No more interest in The Cloud?

Forbes has an article on the level of interest in cloud computing stating that the concept has moved, according to the Gartner Hype Curve into the so called 'trough of disillusionment'.

"Lately I’ve been hearing some rumblings during my various discussions around cloud computing. Some in the industry have been quietly saying the end is near for the much over hyped term.  I wouldn’t go as far as to say the cloud is dead just yet, but there is a growing sense that  interest in cloud computing, at least from the point of view of a buzz word, has peaked." 
Reuven Cohen, Contributor, Forbes, 24 May 2012

Now something has worried me a little in how "cloud" has been approached is that it appears in most cases to be overwhelmingly oriented towards either data storage or as a mechanism for off-loading or virtualising IT infrastructure - in the latter case, not running your own computers but just running your computers even though they are virtual machines.

At least from the desktop or home user, cloud is little more than data storage, eg: Flickr, Youtube, and at best a simple sharing mechanism.

My personal feeling here is that we're really missing the killer application (not the killer 'app') that would truely start to liberate computing from being locked to a single computing element (eg: my laptop) or some super client-server thing into a totally virtualised computing entity with processes and data (or even information!) being truly mobile according to computation need.

One could argue that Google is currently providing something close to this, but I still think that the granularity and information integration needs to decrease and increase respectively to approach what I'm looking for.

Maybe there isn't a killer application...for example, the way the Semantic Web could become the killer platform on top of the cloud? In this sense then cloud becomes the killer enabler.

Either way, 'cloud' was always going to be a disappointment; or am I juts being cynical under a barrage of buzzwords?


Thursday 24 May 2012

Apps considered harmful

OK, first of all this is joint work with my great colleague Ora Lassila. I've just come off the phone with him discussing our World Domination Plan(TM), or how to unleash the power of the Semantic Web and revolutionize computing...anyway, Ora is in Helsinki next month giving a keynote speech at a conference on museums and we were discussing the title based on various thoughts of ours regarding "apps" and previous work of ours on cloud-based, semantic information systems, eg: sTuples [1] [and Sedvice/M3 [2] (also see here).

One of the things we vehemently agreed upon was that "apps must die" for the reason that "apps" are a new way of isolating (both physically and semantically!) data and preventing interactivity; whereas Sedvice/M3 specifically was designed to break down the semantic isolation and ultimately reduce "apps" to be nothing more than domain specific user-interface manifestations of information - and ideally in some cases, become a novel user-interaction mechanism.

After tonight's discussion, I make a quite search on the "apps considered harmful" meme and found an article "Apps Considered Harmful: Part1" at Elderlore's blog. I really like this quote:

"So,what can we do? In part two of this article,I will talk about moving beyond the fragmented experiences of App World into something more interesting,where data and programs are explicitly represented as objects that you are empowered and invited to interconnect. Consider something like GNU Emacs:the separate “applications”in the Emacs Lisp space are delivered as bundles of functions and variables and hooks,not as single-screen apps that cannot interact or share data. You get things done in Emacs by customizing variables and hooking things together;potentially,any function in any Elisp application can call any other function or access any other data in the system,regardless of which package implements the functionality,and most of the coolness of emacs is in being able to coordinate things this way when solving problems,and to save these coordinations as resuable elisp code."

and also

"I want to smash the App World into pieces;that is to break apps into smaller,reusable pieces or “blocks”which can be visually clicked together like MIT Scratch,but on an industrial scale. Apple already banned this type of software from the iPad;perhaps they know something they aren’t telling us?"

So do we! So de we! Even the EU has started talking about data liberation!

Much of the reasoning behind promoting the "Apple style of app" (remember Nokia's Widsets made years earlier - actually they were coming quite close to our ideal?!) is based around locking-in the developer and especially the user. Indeed pretty much every piece of functionality in most "apps" found on all smart phone operating systems is for consumption of data and recording in specific detail the user's behaviour associated with that particular consumption interaction. Ironically all that functionality is present in the generic browser (eg: Opera, Firefox etc) and usually, much more control over your individual privacy too.

Privacy however is a side issue here for the moment and our focus is actually on what an "app" is in the information consumption context above. Indeed as we demonstrated in M3 it is little more than a query over an information space. In the case of M3 and the RDF/OWL graph view of information, a navigation between chunks of information. Indeed this is the basis of Facebook's social graph.

The power here doesn't just come from the ability to easily navigate over information but also the ease in which reasoning is built on top; from the simple subclassing inferences to complex context specific, non-monotonic rules.

Through this reasoning and suitable ontological structures it becomes almost trivial to add functionality into the graph, eg: new messaging types, and apps which are designed to query for super-types of those automatically pick-up on these new objects. This is hard to do in an "app" - which are more like the traditional applications of old: fixed and unexpandable - who thought of Twitter and Facebook contacts when writing contact book apps all those years ago?

The point where we're heading here is that what we think that "apps" should have been are just queries over information with reasoning at the information store level, that is, pushing the logicl that gives semantics, interoperability and expandability down into the lower-levels of the information stack.

One can argue that "apps" in their current manifestations are optimised for the devices they run upon, eg: an app for iOS delivers the content in the form best for that device. However we would go further (and we made an amazing demo with M3 of this back in 2009), where we encoded how the information should be displayed in various manifestations, eg: list, tile etc - there was an ontology for this, and then let the device pick the best rendering for that informtaion based upon its local context. What made this interesting was the reasoning mechanism searched the information itself, the contained or linked chunks of information as well as searching the type hierarchy for display hints.

We did the same with actions too...and we even started on serialising computations and passing these around...true cloud computing...quite away from the humble, locked-in current definition of "app"...


References

[1]Deepali Khushraj, Ora Lassila, Tim Finin. (2004) sTuples : Semantic Tuple Spaces. Access (2004) IEEE, Pages: 268-277
[2] Ian Oliver, Jukka Honkola (2008) Personal Semantic Web Through A Space Based Computing Environment. Middleware for the Semantic Web Second IEEE International Conference on Semantic Computing.

Tuesday 22 May 2012

Fountain pens

Good to see a move back to real writing with real instrument of writing. There really is nothing better than putting pen to paper, and a real ink pen at that.

Why are fountain pen sales rising?
22 May 2012 Last updated at 17:08 GMT
By Steven Brocklehurst

You might expect that email and the ballpoint pen had killed the fountain pen. But sales are rising, so is the fountain pen a curious example of an old-fashioned object surviving the winds of change?

...

But for others, a fat Montblanc or a silver-plated Parker is a treasured item. Prominently displayed, they are associated with long, sinuous lines of cursive script.

Sales figures are on the up. Parker, which has manufactured fountain pens since 1888, claims a worldwide "resurgence" in the past five years, and rival Lamy says turnover increased by more than 5% in 2011.

...continued...

Used a fountain pen exclusively in my engineering notebook for years - nothing better focuses the mind as the permanence of ink in such a volume.

Saturday 19 May 2012

Odd topology...

Saw this on a wall in Brighton, got me wondering...



...what topological space are they using or did they divide by zero somewhere?

Tuesday 15 May 2012

Dimensional Analysis of Information


I’ve been looking at dimensional analysis as a technique to use for analyzing information flows, specifically for privacy.

After developing various taxonomies for information classification and some of the superstructure behind these (see here). One of the problems we have seen is trying to evaluate information content from ontologies and data schemata and deciding whether in ontology X, the field “name” has the same (or similar) semantics to a similarly named field in ontology Y.

Inspired by the technique of dimensional analysis, one idea is to consider each ontology as you would a system of units of measurement, eg: imperial units versus metric units. What dimensional analysis does is to abstract away from the units of measurement and into a small set of base or fundamental aspects. Typically these are length, mass and time denoted [L][M] and [T].

For example, acceleration has dimensions [L][T]-2  which using a system of measurements might be expressed as: metres per second per second or furlongs per day per aeon (just to mix things up).

When working with information systems and especially in the case of privacy where we need to classify information we can construct a set of “dimensions”. Choice of these is somewhat arbitrary – or at least they should have some aspect of orthogonality (I said this was inspired by dimensional analysis!).

The dimensions I chose were: Personal, Financial, Health, Time, Location, Identity and Content

Aside: To save on space an hint to the dimensional analysis inspiration, I'll use the first letter of the dimension name inside square brackets [ and ]...

We can have huge debates (and we did) about whether these are truly orthogonal and what happens when data elements or types are mapped to more than one 'dimension' - I don't think it matters too much at the moment, so let's put some of those difficulties aside.

Actually each of these is a top-level class in a taxonomy of information classification. For example, dimension the [P] breaks down into Demographics ([P_D]) and Contact ([P_C]), other classes follow similarly – as shown in the diagram below:

Information Type Taxonomy
Given a data schema we can map this schema into its dimensions in much the same was as done with physical quantities, for example the schema:

UserID x DeviceID x CellID x Timestamp x Age

Would be mapped to: [I]3[T][P]   meaning 3 units of identifiers, one of time and one of personal information. Actually as I stated earlier we actually have a hierarchy of dimensions so we might break down to: [I_P][I]2[T][P_D]  where [I_P] is a personal identifier and [P_D] is demographics, each being a sub classification of the [I] and [P].

The kinds of analysis that can be made are the quick identification of critical information content issues, such as in the above case we have a mix of identities which allows for potentially very precise identification. We have time involved which might allow profiling or tracking and an element of personal (demographic) information.

Furthermore we even have the chance that one identifier can be mapped to a location: CellIDs can easily be transformed into GPS coordinates and over time fairly easily be triangulated, especially when in an area densely populated by mobile base stations. Actually the above example could be mapped as [I]2[L][P] and indeed for a given data-schema being expressably in more than one dimensional form does raise some interesting concerns.

If we have some functions that process data, say a function that anonymises identities (we can have the discussion what anonymisation means later – please don’t mention hashing functions!) then application of this might result in our original dimensions [I]3[T][P] being mapped via that anonymisation function to [I]2[T][P] – an improvement in terms of moving towards anonymity maybe.

And so on....now whether this is really is dimensional analysis is another thing altogether, I doubt it largely in the current form and certainly I've made no major effort into properties of dimensional analysis such as commensurability or other mathematical properties. I'm also wondering if that other favorite of mine - entropy - can be put in here somewhere, as a coefficient to the dimension possibly? I think that might be taking things too far and is ultimately confusing concepts.

I've had some successes in terms of applying this to data-flow modelling of information flows and a couple of interesting results when we've discussed things such as legal consents, the application of a content to data and processing of that data. For example, take the humble IP address or CellID from the above example....the dimension of these is [I] (actually I have some subclass of Identity which deals with machine addresses), however both can be mapped to [L] fairly easily. Things such as expressing in a consent that we don't identify the source of information could mean mapping such things to other 'types' in other dimensions and actually end up not preserving privacy or even accidentally revealing more semantically interesting content...

Monday 14 May 2012

Data-Flows and Measurement of Expectation of Privacy


I've been in a workshop all day about privacy with a mixed audience of legal, marketing and technical people; and its quite interesting to see that we starting to have some convergence on that privacy is more about information itself, the flow of information and the usage of that information within the context of those flows rather than the usual discussion about how to prevent collection of data.

There is relatively little wrong - given the correct context - with data collection, and indeed in many cases it is ineviatable, eg: web server or service access logs. The usage of these logs for system monitoring is the typical scenario, which is a necessary function of actually running those infastructures. The main point here is really aimed at secondary data collection or behavioural data collection scenarios.

So that aside for a moment, we've come to the obvious conclusion that security is a necessary base for privacy, which in turn is a necessary base for trust. We've also discussed the notion of rights and what rights a consumer has over their data, or more correctly, their information.

Which all brings me back to that most of the discussions are touching on the need for an understanding of the flow and measure of information. How do we measure, what do we measure, how much information, is there too much information etc?

Putting this in the context of information management, ontologies/taxonomies of information and data-flow we have the beginnings of a rather elegant framework for understanding the flow of information from this perspective. Sounds close to Nissenbaum's hypothesis on privacy and expectations which is very nice - which is something I've written on before and I guess some of the things here is a development of some thoughts there...

For me this means that some ideas I’ve had of information classification, dimensional analysis and measures (metrics even) are starting to coalesce nicely...quite exciting.

In a panel session a discussion was held on the rights and relationships of privacy to the consumer and started to emphasise on the expectation of privacy based in various scenarios: placing data in the cloud, driving on a public highway and in relation to the latter the case with the US government's regarding the placement of GPS trackers on peoples' cars without their knowledge.

We can construct a data-flow model of this:

          
Over each flow we can measure the amount, sensitivity and type of information - I have no idea what this "number" or even the structure of that "number" might look like, though I do believe that it is certainly measureable, ie: we can take two values and compare them.

A person then assigns or has an expectation of privacy in various situations, if the data-flow exceeds that then there is a privacy issue. So, using some “arbitrary” values for the measures, we might have expectations ‘E’ for each flow:

  • E(Person->Cloud) is 7
  • E(Person->Highway) is 3
  • E(Highway->Government) is 2

The higher the number, the greater amount of information a user is willing to tolerate being communicated over that data-flow.

Then at some point in time’t’ the actual measure ‘M’ of information, maybe something like

  • M_t1(Person->Cloud) = 5
  • M_t1(Person->Highway) = 2
  • M_t1(Person->Cloud) = 4

If for some data-flow ‘d’, at a point in time ‘t’, M_t(d)>E(d) then we have a problem regarding the amount of information being transmitter is greater than the expectations of the user.

Aside: yes, I know using integers to denote amount is fairly naïve, but I’m just trying to get a point across more than anything – I think the structure we’d be working with is some horrible multi-dimensional, tensor/spinor monster….

While the current laws tend to focus on the fact that anything ‘in public’ is ‘public’, Solove, Nissenbaum, Acquisti and others have noted that what happens in public is not always necessarily. As shown in the data-flow above, a person's expectation of privacy towards some cloudified service environment, eg: Google, Nokia etc is very different to their expectation of privacy when driving in their car on public roads. Similarly the information flow between public roads and the government, eg: traffic cameras etc has certain expectations of privacy.

When we have information flow over more than one individual flow, for example, what is the user's expectation of privacy when information about their driving on a public road flows to the government? The case with GPS trackers has shown that there are expectation limits that are different from the individual expectations within the individual flows, for example:

  • E(Person->Highway->Government) = 1

What this eventually leads to is that as data-flows get longer and involve more participants the expectation of privacy increases, but in reality beyond one or two steps the visibility of the data-flow diminishes to the user, for example, to where does Google or Facebook send or sell their data? Also how. and could this value be calculated from each of the individual flows? I can imagine that we might even see some kind of power law operating over this too…

Many other questions arise, how do we measure information content – at least in terms of the above channels? What is an information channel? To conclude for the moment, it does appear that we can relatively easily define how to how these measures might behave over a data-flow, the question now remains – and this is the really interesting question – is how to actually construct the measure itself.

Sunday 13 May 2012

Kusunda

The BBC has a report on the last speaker of the Kusunda language of Nepal. Always sad not just to lose a language but a way of thinking. Also very interesting to see a link to the languages of the Andaman Islands and the potential of a link with the Sentinelese people and language.

Nepal's mystery language on the verge of extinction




Gyani Maiya Sen, a 75-year-old woman from western Nepal, can perhaps be forgiven for feeling that the weight of the world rests on her shoulders.

She is the only person still alive in Nepal who fluently speaks the Kusunda language. The unknown origins and mysterious sentence structures of Kusunda have long baffled linguists.

Thursday 3 May 2012

Google Wardriving, Wifi, Privacy and an Engineer to fault?

Information Week has an article about Google collecting Wifi identifiers and snooping on unencrypted traffic over people's networks:
Blame the Street View data collection practices on a "more is more" engineering mindset. And rethink your notions about privacy for unencrypted Wi-Fi data.Mathew J. Schwartz | May 01, 2012 10:50 AM
During a two-year period, Google captured oodles of Wi-Fi data worldwide as part of its Street View program. But why? 
Blame the engineering ethos that's prevalent at high-technology companies like Google. You know the "more is more" mindset: more bells and whistles equals greater goodness. 
But an unfiltered engineering mindset would help explain theapparent thinking behind the Street View wardriving program: "Well, if this Wi-Fi data is flying around and no one is encrypting it, what reasonable expectation do they have that it won't be sniffed and stored?"



This whole episode is starting to look like it is just about finding a suitable scapegoat whereas in reality the failures were multiple and in this instance happened to line us, cf: the Swiss-cheese analogy used in aircraft accident investigations.

I strongly doubt a "rouge engineer" was responsible - it takes many engineers to build and deploy these systems (look at the average size of an R&D team in most larger companies and the process/personnel infrastructure surrounding those). Now, admittedly it is possible that this code was added by a single engineer but do Google really have such lax software engineering practices?

R&D teams are under pressure to collect as much information as possible from applications and services, or, have the ability to collect and collect just in case. Mechanisms to collect WiFi data and snoop on the contents are easily available and part of the usual networking infrastructure tool kits - they need to be otherwise the features you rely upon in your operating systems etc wouldn't work.

According to reports the engineers asked for help from legal; in which case we have two points of failure, one that the engineers might have been asking the wrong questions and that legal didn't understand or respond. The failure might also have been at product/programme management level which might have blocked these requests or overridden them. It might just have been that engineering and legal acted correctly within the context of working in Google.

The statement in the Information Week article "an unfiltered engineering mindset would help explain the apparent thinking" is extremely misleading - the failures are multiple and most likely actually don't just involve engineering at all - the real blame is most likely elsewhere: product, programme management?

Also, even if Google have taken the so called Privacy-by-Design [1] ideas on-board, their communication and implementation might be very different from what is intended. As a similar example, Agile Methods have a similar set of statements and their implementation in many cases is grossly misunderstood (at best).

While statements of principle like PbD or even the Agile Manifesto [2] are fine, their implementation both technically and within the culture of the company as a whole are neglected at best and utterly misunderstood at worst. In here lies the real problems...unfortunately therein lies too many hard questions: changing a culture whilst maintaining maturity of method and process is hard.

I was also wondering about the title of the article "Engineering Trumped Privacy" and though it both misleading and somewhat offensive (to engineering), however the meaning of the work "trumped" can be taken in a slightly different way to mean "engineering exposed flaws in the overall implementation and understanding of privacy".

Finally, I do think the statement "never attribute to malice what can be explained by stupidity" applied here. At various levels in Google I don't think there was any intention of doing anything "bad", but for a company that tells everyone else not to be evil, malice does seem a logical choice fuelled by whatever conspiracy theory you choose.

Just like a pilot is "always" at blame in an aircraft crash, I wonder how Marcus Miller is feeling today...


References

Helsinki Times interview on Privacy

I was interviewed by Helsinki Times for a short article on privacy, here's the first paragraph and link:

OnLine Privacy
3 May 2012, David Cord
Helsinki Times talks to Ian Oliver at Nokia Location & Commerce about staying safe online.
FOR over two years, Ian Oliver has been the principal architect of Privacy and Policy at Nokia Location & Commerce. His main job is on the technical side of privacy, as well as working with the company’s legal and consumer advocacy teams. He took the time to share some of his personal views about online privacy.
 

Wednesday 2 May 2012

International Journal on Advances in Intelligent Systems

I'm on the editorial board of the IARIA International Journal on Advances in Intelligent Systems




The journal is dedicated to specific topics related to automation, static and mobile agents, decision systems, special computational paradigms, advances in computer-human interaction, human-oriented modeling, and human-centric service and applications.

Special issues can focus on particular aspects related to autonomic components and systems, advanced correlation algorithms, applications of artificial intelligence, adaptive and interactive interfaces, ubiquitous services, anticipative systems, unmanned systems, robotics, processing of distributed geospatial data, or context-oriented information retrieval and processing.

Editor-in-Chief: Freimut Bodendorf, University of Erlangen-Nuernberg, Germany
issn: 1942-2679