I get a lot of requests from people for data from Have I been pwned (HIBP) that they can analyse. Now obviously, there are a bunch of people up to no good requesting the data but equally, there are many others who just want to run statistics. Regardless, the answer has always been "no", I'm not going to redistribute data to you. In fact, the requests were happening so frequently that I even wrote the blog post No, I cannot share data breaches with you.
However, as part of HIBP's 3rd birthday celebrations, I am going to share data with you, quite a lot of it. In fact, I'm opening up almost all the data in HIBP with a few very important caveats:
- All personally identifiable information has been removed
- All information about the domain each account is on has been removed
- All sensitive data breaches have been removed
As much as I want to provide data for analysis, I don't want to put anyone at any further risk which is why the personally identifiable data is gone. I've been careful to ensure the system is not open for abuse by virtue of efforts such as the API rate limit and serving up all the raw data in one big file would obviously not further that objective in any way.
Removing the domain data means that insights about who's been impacted where can't be implied. I don't want someone working out precisely which services a company's staff has accounts on. Obfuscating the domain would still pose a risk: if you can work out what just one account on an obfuscated domain is (say because of the unique set of breaches they appear in when querying the live system), you could resolve the domain.
Removing sensitive breaches ensures that creative people can't find where an account was compromised in a breach I deliberately keep from being publicly searchable. For example, I myself am in 8 non-sensitive breaches and if you took those 8, found a row in the data I'm opening up and saw that those very specific 8 and an adult website were in the same row then that poses a privacy risk. Sensitive breaches account for 11% of the total data in the system so in the overall scheme of things, it's a small loss.
Let me talk about what I've actually done here. All the data about accounts in breaches is stored in Azure Table Storage. I've written at length about Table Storage before and how I designed the partitioning. It means that a single account in there looks like this:
That's obviously my record and you'll see that the domain of my email address is the partition key, the alias is the row key then there's a semicolon delimited array of pwned sites and a time stamp of when the record was last modified. What I'm doing is extracting just the websites for each account, removing any sensitive ones then representing it all in one line as follows:
There are exactly 1,431,112,732 rows adhering to this pattern in the file I'm giving everyone today. That's a 15.3GB file which I could have distributed out just like this, but that would mean massive redundancy because you get a bunch of rows that look just like this:
So what I've done instead is aggregated the results, grouping them by the impacted sites and putting a count next to them which brings it down to a 135MB text file. This means that there's only one row that shows only LinkedIn and it looks like this:
In other words, there are 105M email addresses that appear only against LinkedIn but rather than using that many rows in a text file to describe it, there's just the one row. Other times you'll see rows like this:
There are 20 people that have been pwned in that unique combination of 5 websites. Of course there are many, many combinations with only 1 entry. My own, for example, is the only email address which appears in those 8 unique sites. (This should also help explain why I needed to remove the sensitive data breaches in order the protect privacy.)
Let me summarise exactly how these numbers break down as I know people who analyse data in depth like to understand these things precisely:
- 1,989,141,353 - the number of accounts currently represented as being in HIBP. This is every occurrence of an account in a breached system so my email address has 8 records in there. This number also includes usernames; in cases like Snapchat, there are 4.6M records and none of them are email addresses.
- 1,574,694,164 - the number of unique email addresses. It's 400M lower than the previous number because a bunch of email addresses have appeared in multiple data breaches and it doesn't include any usernames. (Sidenote: I don't tend to load usernames where email addresses are available instead.)
- 1,431,112,732 - the number of unique email addresses which contains one or more accounts that are not sensitive. This means that almost 144M email addresses only appeared in a sensitive breach. (Incidentally, there are a total of 221M accounts in breaches marked as sensitive in HIBP.)
- 2,399,307 - the unique number of website combinations accounts have appeared in. This is how many rows are in the file I'm sharing and there's a count against each one showing how many times this combination of sites appears against an email address.
For each row, you can then take the breach names and reconcile them to the list of breaches exposed in the API. What this means is that you can access all the other attributes of the incident, for example here's Dropbox:
"Description":"In mid-2012, Dropbox suffered a data breach which exposed the stored credentials of tens of millions of their customers. In August 2016, <a href=\"https://motherboard.vice.com/read/dropbox-forces-password-resets-after-user-credentials-exposed\" target=\"_blank\" rel=\"noopener\">they forced password resets for customers they believed may be at risk</a>. A large volume of data totalling over 68 million records <a href=\"https://motherboard.vice.com/read/hackers-stole-over-60-million-dropbox-accounts\" target=\"_blank\" rel=\"noopener\">was subsequently traded online</a> and included email addresses and salted hashes of passwords (half of them SHA1, half of them bcrypt).",
The value in the file I'm distributing is the "Name" attribute you see above. It often differs from the "Title" attribute in that the former is a stable, alphanumeric value not intended for public display whilst the latter is a human-facing value that may change (i.e. if Dropbox gets popped again and I need to differentiate the two incidents). For example, the name "ModernBusinessSolutions" is different from the title "Modern Business Solutions". Just something to keep in mind if you're reconciling data back to the breach entities.
This should give those who are interested in analysing data breach patterns a heap of info to work with. Ideas that come to mind include the number of breaches accounts appear in, the data attributes exposed about them, their exposure over time and all sorts of other things I haven't even thought of. I'd love to see people turn this into some awesome visualisations; my mate John Bristowe did this neat dashboard in Microsoft's Power BI recently and that's just with data already available via the public API:
If you create visualisations or other insights into the data, do leave a comment below and share what you've done.
Now, how to get the data: in order to save me bandwidth costs and help it easily spread to those who want it, you can download the torrent file or grab the magnet link here:
And for the extra cautious, the SHA1 hash of the zip file is:
I suspect there'll be a large amount of interest in this so I'd like to ask anyone torrenting it to leave it seeding for a while too if they can, especially in the early days while it's distributed around. The goal is to make this data broadly available and enable people to do awesome things with it, so for that I need some community support.
If you have questions about the data, please leave them in the comments section below. Keep in mind that there may be - no, will be - discrepancies. If you retrieve the same data breach from elsewhere online and extract the accounts, they probably won't match exactly because the pattern matching for email addresses may differ slightly. If you crunched the numbers from the data I'm providing here they may not match exactly with what I've represented for each incident on HIBP as I've done multiple loads of some incidents. There'll be rows in the file which appear to have too many pwned sites or an odd collection of them due to fabricated email addresses (that firstname.lastname@example.org person has been really pwned!) In short, expect to see some inconsistencies but what I can say for sure is that once you "rehydrate" that data and add up the counts on each row you'll get to precisely 1,431,112,732 records.
One more thing: when you take a look in the zip you'll see a license.txt file. This is exactly the same license as on the API itself, that is it's a Creative Commons Attribution license. You can use the data to do whatever you'd like (including for commercial purposes), but just be clear about where it came from. This is all part of being transparent and particularly when it comes to a file full of data sourced from breaches, I want it to be crystal clear what it is and where it's from so that it's not misrepresented.
Please take this data and do awesome things with it. If you find it useful and want to contribute back to the project, check out the donations page. Regardless, do share any insights you've gained in the comments below, I'd love to see what people can do with this!