Sunday, 20 October 2013

Data Aspects and Rules

In the previous post we introduced how to annotate data flows in order to understand better what data was being transported and how. In this post I will introduce further classifications expressed as aspects.

We already have transport and information class as a start; the further classifications we will introduce are:
  • Purpose
  • Usage
  • Provenance
Purpose is relatively straightforward, and consists of two classes: Primary and Secondary. These are defined in this previous posting.

Usage is remarkably hard to define and the categories tend to be quite context specific, though patterns do emerge. The base set of categories I tend to use are:
  • system provisioning - the data is being used to facilitate the running and management of the system providing the service, eg: logging, system administration etc.
  • service provisioning - the data is being used to facilitate the service itself; this means the data is necessary for the basic functionality of that service, or primary data.
  • advertising - the data is being used for advertising (tageted or otherwise), by the service provider or third party
  • marketing - the data is being used for direct marketing back to the source of the data
  • profiling - the data is being used to construct a profile of the user/consumer/customer. It might be useful in some cases to denote a subtype of this - CRM - to explicitly differentiate between "marketing"  and "internal business" profiling.
Some of the above tend to occur often together, for example, data for service provisioning is often also used for advertising and/or marketing too.

Provenance denotes the source of the information and is typically readable from the data-flow model itself. There does exist a proposed standard for provenance as defined by the W3C Provenance Working Group. It is however useful to denote for completeness purposes whether data has been collected from the consumer, generated through analytics over a set of data, from a library source etc.

We could enhance our earlier model thus:

As you can see, this starts to be quite cumbersome and the granularity is quite large. Though from the above we can already start to see some privacy issues arise.

The above granularity however is perfectly fine for a first model but to continue we do need to refine the model somewhat to better explain what is really happening. We can construct rules of the form:
  • "Info Class" for "Purpose" purpose used for "Usage"
for example taken from the above model:
  • Picture for Primary purpose used for Service Provisioning
  • Location for Primary purpose used for Service Provisioning
  • Time for Primary purpose used for Service Provisioning
  • Device Address for Secondary purpose used for System Provisioning
  • Location for Primary purpose used for Advertising
  • Location for Primary purpose used for Profiling
  • ...
and so on until we have exhausted all the combination we have, wish or require in our system. Note that some data comes from knowledge of our transport mechanism, in this case a device address (probably IP) from the use of http/s.

These rules now give us a fine grained understanding of what data is being used for what. In the above case, the flow to a social media provider, we might wish to query whether there are issues arising from the supply of location, especially as we might surmise that it is being used for profiling and advertising for example.

For each rule identified we are required to ask whether the source of that data in that particular data flow agrees to and understands where the flow goes, what data is transported and for what purposes; and then finally whether this is "correct" in terms of what we are ultimately promising to the consumer and according to law.

In later articles we will explore this analysis more formally and start also investigating security requirements, country requirements and higher level policy requirements such as safe harbour, PCI, SOX etc.

No comments: