Sunday, 13 October 2013

Classifying Information and Data Flows

In the previous articles on data flow patterns and basic analysis of a data flow model we introduced a number of classifications and annotations to our model. Here we will explain two of these briefly:
  1. Data Flow Annotations
  2. Information Classification
Let's examine this particular data flow from our earlier example:

The first thing to notice is the data-flow annotation in angled brackets (mimicking the UML's stereotype notation) denoting the protocol or implementation used. It is fairly easy to come up with a comprehensive list of these, for example as a useful minimum set might be:
  • internal - meaning some API call over a system bus of some kind
  • http - using the HTTP protocol, eg: a REST call or similar 
  • https - using the HTTPS protocol
  • email - using email
and if necessary these can be combined to denote multiple protocols or possible future design decisions. Here I've written http/s as a shorthand.

Knowing this sets the bounds on the security of the connection, what logging might be taking place at the receiving end and also what kinds of data might be provided by the infrastructure, eg: IP addresses.

* * *

The second classification system we use is to denote what kinds of information are being carried over each data-flow. Again a simple classification structure can be constructed, for example, a minimal set might be:
  • Personal - information such as  home addresses, names, email, demographic data
  • Identifier - user identifiers, device identifiers, app IDs, session identifiers, IP or MAC addresses
  • Time - time points
  • Location - location information of any granularity, typically lat, long as supplied by GPS
  • Content - 'opaque' data such as text, pictures etc
Other classes such as Financial and Health might also be relevant in some systems.

Each of the above should be subclassed as necessary to represent specific kinds of data, for example, we have used the class Picture. The Personal and Identifier categories are quite rich in this respect.

Using high-level categories such as these affords us simplicity and avoids arguments about certain kinds of edge cases as might be seen with some kinds of identifiers. For example, using a hashed or so-called 'anonymous' identifier is still something within the Identifier class, just as much as an IMEI or IP address is. 

Note that we do no explicitly define what PII (personally identifiable information) is, but leave this as something to be inferred from the combination of information being carried both over and by the data flow in question.

* * *

Now that we have the information content and transport mechanisms made we can reason against constraints, risks and threats on our system, such as whether an unencrypted transport such as HTTP is suitable for carrying, in this case, location, time and the picture content; or would a secure connection be better? Then there is also the question of whether encrypting the contents and using HTTP?

We might have the specific requirements:
  • Data-flows containing Location must be over secured connection
  • Secured connections use either encrypted content or a secure protocol such as HTTPS of SFTP.
and translate these into requirements on our above system such as
  • The flow to any social media system must be over HTTPS
Some requirements and constraints might be very general, for example
  • Information of the Identifier class must be sent over secured connection
While the actual identifier itself might be a short-lived, randomly generated number with very little 'identifiability' (to a unique person), the above constraint might be too strong. Each retrenchment such as this can then be specifically evaluated for the additional introduced risk.

* * *

We have shown here is that by simple annotation of the data flow model according to a number of categories we can reason about what information the system is sending, to whom and how. This is the bare minimum for a reasonable privacy evaluation of any system.

Indeed even with the two above categories we can already construct a reasonably sophisticated and rigorous mapping and reasoning against our requirements and general system constraints. We can even as we briefly touched upon start some deeper analysis of specific risks introduced through retrenchments to these rules.

* * * 

The order in which things are classified is not necessarily important - we leave that to the development processes already in place. Having a model provides us with unambiguous information about the decisions made over various parts of the system - applying the inferences from these is the critical lesson to be taken into consideration.

We have other classifications still to discuss, such as security classifications (secret, confidential, public etc), provenance, usage, purpose, authentication mechanisms - these will be presented in forthcoming articles in more detail.

Constructing these classification systems might appear to be hard word; certainly it takes some effort to implement and ensure that they are active employed, but security standards such as ISO27000 do require this.


Silvester Norman said...

Is it true that data can be stolen while it is being transferred through the communication channel?

Silvester Norman

Change MAC Address

Ian Oliver said...

Yes, data channels can be compromised. Generally you wouldn't see this in the model of the system unless you were explicitly modelling this aspect.

However, understanding where the data channels are, what they carry and how they carry this helps in understanding where the weak points are. For example, in the examples here if we were using http then this would be a good indication that the channel should not contain sensitive information. Similarly any channel that is identified as using mechanisms outside of your control, for example, over the public internet, or a transport such as email, would be immediate candidates for further analysis to establish whether this were acceptable risks or not.