Wednesday, 25 September 2013

Data Handling

Organisations hold databases which contain personally identifiable information which if leaked could cause harm to all parties. Here we start to transcend from "security and privacy" as technological functions to "trust", particularly trust that people who have access to the data handle it safely. The maxim that needs to be applied when handling data:

Data handling is a safety-crtical exercise.

I'd therefore like to briefly describe a number of points on this subject to lay out which areas of focus need to be made. These guides should be adapted to local context and applied regardless of data content, despite how benign the data might appear [1]:

Collection
  • data collection should be minimised and avoid superfluous, unnecessary datapoints. In particular data pertaining to race, ethniticty, health, sexual persuasion and politics are to be avoided except in certain well-defined cases such as personel records or particular kinds of ethnographic research.
  • be aware of the possibilities of cross-referencing the data and the accuracy at which certain data points can be collected, eg: street name vs GPS coördinates
  • the data subject must be informed about the collection of any personal details
  • be clear for what purposes and usages this data is being collected and stick to those purposes and usages
Storage and Transmission
  • the storage of the data must have the necessary, relevant protections in place such as encryption at file system, database and/or field level as necessary
  • be aware of the location of the storage and whether this is in-house or in-cloud. Certain jurisdictions take a dim view of data leaving their countries for example
  • data in transit should be over secure medium and/or encrypted
  • emailing or moving data by physical media is as much transmission of data as is downloading via web browser and the like
  • be very, very careful about using email, instant messaging and other common messaging systems. Some systems support security classification and encryption, be sure to know how and when to use those mechanisms. Even internal company email is not secure!
Deletion
  • data deletion should always be made to appropriate standards such as that detailed in the DoD 5220.22-M and AFSSI-5020 standards
  • if data is deleted then ensure that certifcates of destruction for the data and any hardware including backups is obtained 
  • virtual machines, "cloud" and hosted services are always problematic - only use those which have a detailed and sufficient data deletion procedures 
  • don't forget about email archives, backup devices, "forgotten" memory sticks, cameras etc...
Security Classifications and Markings
  • all data must be assessed for its sensitivity and marked appropriately
  • a simple Secret-Confidential-Public-Unclassified system suffices with approriate defined requirements for each level
  • further classification such as content marking may also be employed
  • ensure that everyone knows what these clasifications mean
Access Control
  • only persons who have a need to view and hold the data are given access
  • access control here is primarly a back-office procedure which may have a technical implementation
  • auditing a logging of accesses, grants of access and revokations should be made
  • a revokation of access must be made as soon as it becomes necessary, eg: someone leaving the company.
  • ensure in the above case that any data still held by that person, eg: backups, memory sticks, laptop contents etc is also dealt with
  • are you procedures documented sufficiently and in compliance with necessary standards, eg: ISO9000 etc? Even if you don't need to be certified you can still apply good practice!
Extraction
  • when data is extracted from a database ensure that the appropriate security handling, markings and requirements are preserved on the extract. If the database is Secret then so should be all extracts from that data
  • extracts should be sanitised - stripped of unnecessary fields - and anonymised/obfuscated as necessary - these processed extracts need to be reclassified!
Handling inside a "safe" environment
  • a safe environment is one where the above security controls can be enforced, eg: a company's intranet, and even here the movement of data by phsyical means (memory stick etc) or electronic means (emails) must be minimised 
  • internal email systems are not necessarily secure
  • clean desk and locked office policies may be required in some instances
  • can an audit trail of access be established?
Handling outside a "safe" environment
  • outside of the safe environment, eg: in public, on the train or at conference, precautions must be made to ensure the data remains unavailable to non authorised parties through mechansims such as full disk encryption or file encryption
  • use of removable media such as memory sticks should be discouraged
  • printing of data similar so, especially when it comes to disposal of printed media
  • the data and the handler should not be parted unless it is absolutely necessary, for example, packing the laptop into the checked baggage on an aircraft; and even this situation might not be safe in all cases
Breach Handling 
  • a procedure for handling loss of data needs to be defined
  • such a procedure needs to establish: the handler, the data set, the media on which the data was stored, the contents, the amount, the time of loss and location. For example, were the company's financial statements for the coming year lost on an unencrypted memory stick on a train late a night versus was an encrypted laptop stolen?
  • the necessity to inform authorities (eg: police, data breach announcement) must be established

Note the analogies between handling data and the procedures for handling chemicals in an industrial context or patients in a medical context. Of course, the above guides are just guides and incomplete as we can not foresee every situation and context - local customisation of the implementation of there is always required.


Notes:

[1] The UK's Health and Safety Executive COSHH guidelines on handling water - these might surprise you for a such an "inert" substance, for example: Portable Water and Legionella Control and Hot and Cold Water Systems.

Sunday, 22 September 2013

Learning Languages

I spent part of this morning watching childrens' TV with my son, in particular we watched Unna Junna - a children's programme broadcast on the Finnish YLE network in the Sami language which was conveniently subtitled in Finnish.

Aside from the discussion about what the presenter was saying and its translation into Finnish, at least for the words I recognised or could guess, this ended up for me as an early morning exercise in comparitive linguistics.

For almost as long as I can remember linguistics and language excited me and this morning was just one of those exciting lingustic experiences so belolved of polyglots. After studying Finnish for many years I find myself getting excited about reading road signs in Estonian, children's TV in Sami, etc and mapping these to my knowledge of Finnish. I find this pretty cool.

I also came across today a video of a presentation by Anthony Lauder on "PolyNots" given at a Polyglot conference in Budapest in 2013. This video is worth watching for just for Anthony's presentation skills alone. From this video (if the YouTube Dieties are smiling on you) you'll get links to a host of other videos on multilingualism and language.

Something I've noted is that many polyglots and people who are generally interested in languages all seem to admit that they were never any good in school. Now, for me, I certainly remember been rote schooled in French and German much to my disappointment and also to much detriment of the learning process; enjoyment was quite literally a foreign concept. To this day I recall German lessons in one school as being a pedagogical nightmare: crammed into a small room with over 40 other teenagers all hell bent on not learning with a teacher whose attiude to teaching was less than exemplary.

During these times I took solice in buying dictionaries and book on languages. I still have a dogeared copy of Russian Made Simple [1] which I studied intently.

Though having no support in terms of another Russian speaker and nor as it turned out any help with certain linguistic concepts made things `difficult' to say the least. I remember one incident with a school teacher when I asked what the dative case was - sadly the answer wasn't an explanation of how indirect objects and transitive verbs work but rather a full scale dressing down of my poor performance in French and German lessons - I wonder why?

So if I were to learn another language again it obviously couldn't be on the terms of the UK secondary education system. University was quite a different matter but I spend most of my time studying computer science and mathematics. These however provided me with an interesting set of tools for natural language learning.

Aside: I wrote my bachelor's degree dissertation on machine translation - coincidence?

The first tool is that all languages follow some general patterns. At least most Indo-European and Finno-Ugric languages do. There will be numbers, there will be pronouns, there will be nouns of various kinds, there will be verbs and tenses, possibly even adjectives and adverbs too. All sentences have a mix of subject, verb and object in various orders. So at that level there isn't too much difference eh?

Actually for the most part you can make a one-to-one mapping from your mother tongue to any other langauge and get by. For example:

English:  I, He, She, We, You, Red, Blue, House, Dog,...
Welsh: Fi, Fe, Hi, Ni, Chi, Coch, Glas, Ty, Ci,...
Finnish: Minä, Hän, Hän, Me, Te, Punainen, Sininen, Talo, Koira,...
RussianЯ, он, она, мы, вы, красный, синий, дом, собака, ...
Estonian: Mina, Ta, Ta, Me, Sa, Punane, Sinine, Maja, Koer, ...

Note the similarities between Estonian and Finnish, learn one and you almost get the other for free! The only thing that makes Russian a little more difficult is the script.

Once you get by, then you can build vocabulary, attune yourself to the subtlies in expressing yourself in the new language and most importantly gain confidence.

Actually let's emphasise the latter: GAIN CONFIDENCE. The difference between a child learning a language and an adult is that children have infinite confidence and don't care about not understanding, making mistakes and playing with the language.

Learn a small core set of words: I, you, he, she, it, we, they, red, blue, green, one, two three, come, go, buy, want, please, thankyou, hello, big small, "help! I'm trying to learn!" etc etc etc

Learn words that you find interesting: if you like Formula 1 then learn the words relevant there: race, win, crash, speed, overtake etc (acutally these could be very useful in any conversation with motorsport obsessed Finns).

Don't worry about perfect or sometimes even vaguely correct grammar.

The more you use a langauge and the more you TRY to use a language the better you will become and the more accepting of your mistakes. The better you will become in terms of grammar and style.

You WILL MAKE MISTAKES ... if you analyse to two native speakers against what the books tell you then you will notice immediately that they are making huge amounts of "mistakes" with the grammar, phrasing etc. Remember what we said about how children learn languages.

Read stuff that you find interesting: many learners books and I remember one newspaper for Finnish learners are so simplified and grammatically correct they were totally uninteresting and demoralising to read. If you're trying to learn Finnish go read Mika Valtari's books from the Komisario Palmu books to his literary classics such as Sinuhe. Palmu is Finland's answer to James Bond.

Here you will learn three things:
  1. there's huge amount you don't understand
  2. the bits that you do understand greatly compensate for the bits you don't understand and you'll learn to guess and work around the bits you don't understand.
  3. you'll have fun and gain confidence
If you don't understand something GUESS! This works really well in speech as well in comprehension and reading.

In Anthony's talk he described the two step process for learning 10 languages (based on Peano's Axioms apparently!!)
  • Step 1: Learn 9
  • Step 2: Add 1
Of course, the first new language is the hardest, but once you've learnt the patterns, a core vocabulary and found out what you like reading and discussing the next one is much easier; just like applying the successor function in Peano's axioms.

Actually at the end of the day you'll be surprised what a little vocabulary and a heap of confidence will do. I'm in no way fluent or even reasonably competent in French, but a knowledge of menus, how to order beer and food gets me remarkably far in France and seems to be very appreciated by the natives. In other words a level of fluency my school language teachers could only have hoped for.

One final word, when speaking with a native in that person's language, resist as much as you can any attempts by that native to speak your language: Finns will almost invariably speak English to a foreigner because "no foreigner learns Finnish" and that "they need the practice in English anyway" (despite most Finns speak English better than most native English speakers). Don't worry about code switching, that is, mixing languages if you don't know a word, keep the flow of conversation going rather than worry about correct grammar, pronunciation etc...indeed that is the very essence of fluency.


References:

[1] Eugene Jackson, Elizabeth Gordon, Geoffrey Braithwaite, Albina Tarasova (1977) Russian Made Simple. W.H.Allen, London. 0-491-01582-B

Thursday, 19 September 2013

Where are all the Information Scientists?

What happened to all the scientists and engineers with a real deep understanding of what information is? We're now apparently firmly in the era of Big Data and yet most of the Big Data implementations seem to stop be building large Hadoop clusters and running modified versions of the canonical word counting programs over log files from Apache.

I had the pleasure of working with a group of people who were blessed with PhDs in mathematics, physics and statistics and yet they complained bitterly about the lack of good, interesting and relevant data in order to properly ply their skills and generate valuable insights.

I know of one person who is utilising some quite cool natural language processing over massive data sets to extract all kinds of context, and yet no-one seems to be interested in what this is achieving.

The current Gartner Hype Curve for Big Data has us on the edge of the precipice into the Trough of Disillusionment and once again when the negative press starts everyone will be surprised. Worse is that adopters of Big Data and analytics will not question why it went wrong but rather blame Big Data for not delivering.

One of the reasons why Big Data is heading for this "failure" is because we do not understand, or more accurately want to understand our data. We do not want to put the investment into understanding the subtle and complex interactions between the various aspects of our data, the classification structures, the processing etc.

Indeed I find much of the current trend to NoSQL as being an excuse for ignoring the very basics of understanding data. Apparently SQL is bad: I'm told that SQL means static table structures, fixed schema, inflexible query languages and all that relational theory stuff about data organisation is obsolete. and irrelevant. Multi terabyte "traditional" databases are uncool, cloudified Linux clusters in AWS running Hadoop and canonical word counting programs over web log files is COOL? At one level we've just made OS/360 and COBOL more complicated without doing much more than was possible 30 years ago.

Have we really forgotten how to manage and construct our data so that it becomes information? Or, as I also worry about, have we lost the ability to understand what our data could tell us and been blinded by the hype behind a technology? Not that it hasn't happened before of course.

Much of our emphasis is on building products and getting cool apps to users; sometimes we get data out of these apps either by primary or secondary means. Surprisingly little thought is placed on what data and how we could really understand that data and even less on the interactions between different datasets and what that could tell us.

Rather than understanding the context in which our information exists we've become obsessed with generating abstract counts of things which translate easily into bar charts on a PowerPoint slide. We're failing at extracting the deeper context.

It's fine to learn that an application is being used by one million users this week and two million the next, but does this really tell you about how people are using that app, who is using that app, why, where and what for? Even more importantly we don't even consider the inverse of the questions, why aren't people using my app?

Consider air miles or store cards - when was the last time an airline or supermarket contacted you to ask you why or why not you are flying/shopping with them? I can guarantee that the airline or supermarkets has number of how many things were bought or flights flown...hardly Big Data and deep analytics is it?

To solve this we really do need the return of the information scientist - someone who understands data, information and knowledge, who understands taxonomies, ontologies and the mathematical underpinnings of these. He or she also needs to know how to implement these, how best to organise a database, a data structure whether it be physically implemented in a triple store, relational database, flat log file etc...

For many this then rears the spectre of data governance and huge data management processes that slow the business down - you're not being agile are you then? Big Process is often the result of not understanding or not wanting to understand how to do things. If you have a number of apps and services wouldn't it be better to harmonise the information collection and ensure that the schemata and underlying semantics are consistent over all your collected data? Surely spending less time on data quality and consistency because quality and consistency are already inherent in your data is a much better use of your resources.

So, if you're having arguments over which NoSQL database to use, or whether triple stores and graph databases have the performance you require or even if Semantic Web is just an academic construct, then you're certainly not doing Big Data and the knowledge you will gain from the data you are capturing will only be supporting your business in the most ephemeral of ways.

Wednesday, 18 September 2013

DNT Dies...

It seems like the W3C's Do Not Track efforts have more lives and deaths than Schrödinger's cat and a recently bought Norwegian Blue parrot. Now after the previous effort of opening the cat's box and finding it alive, or at least, just merely sleeping, or maybe dumping some bird seed inside and hoping for the best, DNT might actually be dead - or at least either the vial of poison has been observed to break, or John Cleese just beat it over a shop counter.

So while the Digital Advertising Alliance exits the DNT group stating a number of reasons including ill-defined scope, no definition of tracking etc, we're back to the same place and that is we have once again failed to define what privacy actually means.

In the Do Not Track and Beyond (W3C Workshop) held back in 2012 I wrote the following in my paper:

We still do not have a good theory of privacy or even common terminological framework that unifies the engineers, scientists, mathematicians, lawyers and consumer advocates - let alone the end user - yet.

Failure to define what DNT is (or was), the semantics of tracking and the very definition of what privacy is in this context was probably the major factor in the failure of DNT to progress and bring disparate groups with differing notions of privacy successfully together.

Indeed the above is just one symptom of the malaise that is affecting privacy and also in a wider scheme the very ideas of data processing and Big Data. We are not spending any time thinking about the semantics or meaning of these. Within the privacy community even trying to start a discussion on the semantics of privacy is fraught with difficulties that only Machiavelli and Kafka could have dreamt about.

From an earlier article entitled on the Naivety of Privacy I wrote:

I think we're missing the point what privacy really is and certainly we have little idea at this time how to effectively build information systems with inherent privacy [3] as a property of those systems. I have one initial conclusion:
WE HAVE NO UNDERLYING THEORY OF PRIVACY
We have no common definitions, common language, common semantics nor mappings between our individual worlds: legal, advocacy and engineering. Worse, in each of these worlds terminology and semantics are not always so well internally defined.

Maybe, the current difficulties of the DNT group will force us to reassess what we mean by "privacy" and "tracking"?


Wednesday, 4 September 2013

The Art of Writing Good Documentation: Teach, Don't Tell

Just come across this posting on how to write documentation, or more specifically, how to write GOOD documentation. It is by Steve Losh and called Teach, Don't Tell.

For example:

If you use many open source libraries you’ve undoubtedly encountered some whose README says something like “read the source”. Every time I see one, I die a little bit inside.

Source code is not documentation. Can you learn to be a guitarist by simply listening to a piece of music intently? Can you become a painter by visiting a lot of museums? Of course not!

This is one of the main reasons why we gave up on brilliant languages, frameworks etc- for example, this is the main reason why we had to give up on Opa sadly.

This is always the case when the experts in a language, framework, library etc are those who are developing the langauge, framework or library. They are so consumed by developing they forget the people who want to develop with their creation.

Teach the developers how to do something, don't fob the developer off with function signatures, obscure examples etc. If a developer is asking naive and "stupid" questions then there's probably a damned good reason why.

Maybe reading Zen and the Art of Motorcycle Maintenance should be compulsory for all?