There seem to be a few articles about at the moment talking about privacy in terms of the consumer selling their data instead of it being just collected by the service provider for free (allegedly in some cases as kind of payment for a presumably free service). In particular the article that appeared in InfoWorld: The Next Consumerization Revolution by Galen Gruman and the presentation by Helen Nissenbaum a month or so ago (I wrote about this in the blog: Privacy, Dataflow and Nissenbaum ) typify a trend to the idea of a "data-mart" for consumer information.
First lets look at the main channels of communication between a user via an application (or [sic.] 'app') and a service. I tend to denote eight particular areas where data is either being transmitted or held as shown in the data-flow diagram below:
- The data transmitted to use a service, for example, if you are using a contacts book then your contact details (which might be private) will have to flow over this channel.
- The storage of your information, using the contacts book example this is where your contacts are stored
- Log data recorded because you used a service, for example Apache logs. This will include identifiers such as machine addresses and browser identification strings as well as actions performed and resources accessed.
- Local data including primary and cached data, cookies etc.
- So called behavioural data which includes data on how you are using an application, device context etc. This is typically referred to as secondary data collection under EU Law.
- Storage of behavioural data
- Communication of data to support services such as login/authentication/federation services
- Log data recorded in the same manner as point 3 above but specific to the support service (point 7).
I'm neglecting protecting the channels themselves - that's a different discussion and more security related but privacy of the channel itself is an important issue as was demonstrated by the BT-Phorm incident.
Now while many talk in terms of a simple data-mart the reality is more complicated in that, assuming that we have a trusted data market provider, in that we have to answer the question on which data we are actually trying to protect.
In the above data-flow there are really only two possible candidates as denoted by the data stores: 2, 3 and 6, respectively:
- Log files: The kinds of data stored here really only has value in terms of finding out how the service is being used from a system administration perspective. Anonymisation can be made though depending upon how this is made profiling can be made over enough information.
- Primary data: This along with behavioural data are the key data assets that are marketable. In the case of a contacts book then a user's social network or acquaintances can be constructed along with a multitude of other personal data. Much of this is often given up in order to get a better service, for example, LinkedIn requests permission to read your Gmail contacts in order to construct your set of professional contacts. In effect there is a kind of data-mart already existing here, however you are selling additional information in that your email contacts are probably in greater contact than from some other sources. To a point this data is not so valuable to the service provide but to the advertisers and marketers to construct a profile of you.
- Behavioural data: This is probably the most valuable for the actual service provider in that gives the basis for profiling usage of a service and segmentation from there. Depending upon what is collected here, much of this may already be the same as that collected via logging (into storage 2). What makes this data asset valuable is that it might contain contextual information such as the status of a device at the time of collection.
The truth of the current focus of business and the ways in which consumers want to work is that data collection is unavoidable. Given the above points then it becomes clear that preventing data collection is thus similarly unavoidable and current mechanisms for selective and coarse-grained data collection are not delivering on the privacy ideal.
A move towards the "data-mart" is probably going to be the best solution, though it does complicate the consumers' interactions with required services. Furthermore, while the business model makes sense, though not necessarily financially at this time, the infrastructure in terms of physical provision of such data-mart services and the necessary developments in anonymisation and identification of data are lacking.
Another aspect that worries me is that while this protects the consumer and directly monetarises their information, it still does not sufficiently prevent further usage after release - something that is a problem now given the interlinking between current information holders. A data-mart approach would only deal with the first level in any case and lead to some interesting issues when the expectation of privacy is compromised outside of the data-mart level.
Finally, what form does this "data-mart" take? Is it a centralised, information proxy over a user's data stores and flows, or, is it a case-by-case contract with service providers? Are there going to be graduations in what is montarised - more data = more service?