Monday, 7 October 2013

Anatomy of an Application's Dataflows

To evaluate privacy in the context of an application we must understand how the information flows between the user, the application, the external services the application uses and any underlying infrastructure or operating system services.

We can construct a simple pattern* to describe this:

Obviously the User is the primary actor in all of this, so that becomes the starting point for the collection of data, which then flows via the application itself in and out of the operating system and towards whatever back-end services, either provided for the application specifically or via some 3rd party, the application requires.

Note that in the above we define a trust boundary (red, dashed line) around the application - this denotes the area inside of which the user has control over their data and confidence that the data remains "safe".

Each data-flow can be, or must be, controllable by the user through some consent mechanism: this might be presentation of a consent text with opt-in/out or a simple "accept this or don't continue installing the application"-type consent.

We then consider the six data-flows and their protection mechanisms:

Data Flow "U"  (User -> Application)
  • This ultimately is the user's decision over what information to provide the application, and even whether the user installs or even runs the application in the first place. If anything then ensuring that the information collected here is relevant and necessary to the application's experience. 
  • Understanding the totality of data collected including that from additional sources and internal cross-referencing is critical to understanding this data-flow in its fullest context.
Data Flow "P" (Application -> Back-end Services)
  • This is the primary flow of data - that is the data which the application requires to function. 
  • The data here will likely be an extension of the data supplied by the user; for example, if the user uploads a picture, then the application may extend this with location data, timestamps etc.
  • The control here is typically embedded in the consent that the user agrees to when using the application for the first time. These consents however are often extended over other data flows too which makes it harder for the user to properly control this data flow
  • For some applications this data flow has to exist for applications to function.
Data Flow "S" (Application -> Back-end Services)
  • This is the secondary flow of data, that is data about the application's operations.
  • The control over this flow is typically embedded in the first time usage consent as data flow "P", but the option to opt-in/out has to be given specifically for this data collection, along with the usage of this data.
  • The implementation of this control may be application specific or centralised/federated over the underlying platform.
  • The data collected over here is not just from the application itself but may also include some data collected for primary means as well as any extended data collected from the infrastructure.
Data Flow  "3" (Application -> 3rd Parties)
  • Primarily we mean additional support functions, eg: federated login, library services such as maps and so on.
  • This data flow need to be specifically analysed in the context in which it is being used but would generally fall under the same consents and constraints as data flow "P".
Data Flows "O_in" and "O_out" (Application <-> O/S, Infrastructure)
  • The underlying platform, frameworks and/or operating system provide may services such as obtaining a mobile device's current location or other probe status, services such as local storage etc.
  • Usage of these services needs to be informed to the user and controlled in both directions, especially when contextual data from the application is supplied over data flow "O_in", eg: storage of data that might become generally available to other applications on the device
  • Collection of data over "O_out" may not be possible to control, but minimisation is always required due to the possibilities that data collected over "O_out" is forwarded in some forward over the data flows "P", "S" and "3".
  • Usually the underlying libraries and functionality of the platform are provided in the application's description before installation, eg: this application uses location services; though rarely is it ever explained why.
Any data-flow which crosses the trust boundary (red, dashed line) must be controllable from the user's perspective so that the user has a choice of what data leaves their control. Depending upon the platform and type of application this boundary may be wholly or partially inside the actually application process itself - care must be taken to ensure that this boundary is as wide as possible to ensure that the user does have trust in how that application handles their data.

The implementation of the control points on each of the data flows as has been noted, may be application specific or centralised across all applications. How the control is presented is primarily a user-interface manner and what controls and the granularity of those controls a user-experience manner.

The general pattern here is for each data-flow that crossed the trust boundary, a control point must be provided in some form. At no point should the user ever have to actually run the application or be in a state where information has to be sent over those data-flows without the control point being explicitly set.

So this constitutes the pattern for application interaction and data-flow; specific cases may have more or less specific data-flows as necessary.

Additional Material:

* There's a very good collection of patterns here at Privacy Patterns, though I've rarely seen patterns targeted towards the software engineer and in the GOF style, which is something we really do need in privacy! Certainly the patterns described at Privacy Patterns can be applied internally to the data-flow pattern given here - then we start approaching what we really do need in privacy engineering!


Antti Vähä-Sipilä said...

Microsoft advocates STRIDE in the context of data flow analysis. As I've been using it a lot, three years ago I tried to extend that into privacy thinking (see http://www.vähä-sipilä.fi/avs/blog/Computing/Security/en.strideplusfour.html). So this is a very data flow centric idea too, closely related.

Nowadays I call the "plus four" TRIM, for the considerations

- Transfer (over the boundary) - is the boundary a meaningful privacy boundary, and do any specific requirements apply

- Retention (is it well defined and understood; applies to data stores)

- Informed consent (is the data emission under user's informed control; applies to data flow sources)

- Minimisation (is the data emission technically the minimum required; applies to data flow sources)

Ian Oliver said...

Hi Antti! Long time since we talked! I used Stride and the MS SDL tool a long while back when I started this; very very useful but huge amounts of pushback from some - PowerPoint is *so* much better at modelling and analysis you know ;-)

Always found it amazing how quickly an answer can be reached even with the construction of a simple model and appropriate classifications and analysis.