This is second in a series of posts on data breaches. Last time I sounded fatalistic on big data breaches. Here I am attempting to understand how data breaches happen. As it turns out, there is a lot of hype and confusion on this question. We have all heard the 100 million data disclosures, and by now, most of us have received multiple notifications from places that lost our data.
Can we answer a simple question – when enterprises lose data, how do they lose it? To answer this, I took the raw statistics compiled by Privacy Rights Clearinghouse. This site records the publicly known incidents of data breach. Attrition has gone further and recorded this in an excel spreadsheet. My research team (thanks: Vai Osborne – my cheeky office life-support & Stephanie Hartenstein) cranked through this spreadsheet and filtered it down to incidents where there is a clear record of how data was lost. We assume that data loss could fall into four categories –
- Email - someone emailed lots of sensitive data in an unauthorized way from inside an enterprise
- Tape – a tape with sensitive data got lost, perhaps from the back of a truck
- Laptop – someone lost a laptop that had sensitive data
- Database – someone accessed or hacked into a data server (database or file server) that stores lots of sensitive data.
So our question becomes – what percentage of data loss incidents fall into these four categories? Before you read further, please stop here and do a fun experiment. Come up with your guess on how frequently you think these four incidents occur.
Here is what we found –
1. Data breach incidents – When we count the total number of data breach incidents to date (2004-2007) we find a total of 318 filtered incidents where data loss or data source was clearly recorded. Of these incidents, we find that laptops make the highest frequency (47% - 149 incidents), databases next (40% - 126 incidents), tapes (11%) and email last (2%).
2. Data loss exposure – When we quantify data breaches by the amount of data lost in each breach – we call this the data loss exposure - the ranking changes somewhat. Of a total of roughly 127 Million data losses, databases rank first (64% - 84 Million), laptops next (25% - 32 Million), tapes (10%), and then email (1%). Note that data loss exposure is fairly approximate – enterprises are still not widely monitoring incidents for data exposure. Even then, this number is quite revealing.
3. Data theft risk - Data loss does not necessarily mean theft, but each theft has to begin with a loss. If we want to calculate the risk of data theft, we need to factor in the probability that the lost data actually falls into bad hands, or in other words, establish that there was “intent” to steal after a data loss incident. To measure data theft risk, we have no data to rely on: we will literally have to make it up. In the attached spreadsheet, I arbitrarily assign this risk to 60% for email, 60% for data base, 20% for tape, 20% for laptop. I think this is reasonable – an internal user emailing or a user hacking into a database seem to have much more willful intent, than a tape or laptop getting lost in the general population. (I once lost a laptop on a plane. I am willing to bet it was cleaned out and sold for parts. The chances it was stolen by someone for data has low odds; certainly less than 20% which I conservatively assumed above.) If you don’t agree with my theft risk, feel free to play with these numbers in the spreadsheet. Multiplying data exposure and theft risk, we find that data theft ranks in the following sequence – first databases by an overwhelming measure (84%), followed by laptops (11%), tapes (4%), and email (1%).
These three charts are shown together in the graph below.

Conclusion: Data breaches are not all equal. The source of data breaches matters.
Contrary to general belief, top two sources of data breaches are databases and laptops: probably in that order – not email or tapes.
What do you think? [I have posted the spreadsheet with this data here, for your use.]