Troy Hunt: Here's how I verify data breaches

Exclusive: Big data breaches found at major email services - expert

Other headlines went on to suggest that you need to change your password right now if you're using the likes of Hotmail or Gmail, among others. The strong implication across the stories I've read is that these mail providers have been hacked and now there's a mega-list of stolen accounts floating around the webs.

The chances of this data actually coming from these service providers is near zero. I say this because firstly, there's a very small chance that providers of this calibre would lose the data, secondly because if they did then we'd be looking at very strong cryptographically hashed passwords which would be near useless (Google isn't sitting them around in plain text or MD5) and thirdly, because I see data like this which can't be accurately attributed back to a source all the time.

That's all I want to say on that particular headline for now, instead I'd like to focus on how I verify data breaches and ensure that when reporters cover them, they report accurately and in a way that doesn't perpetuate FUD. Here's how I verify data breaches.

Sources and the importance of verification

I come across breaches via a few different channels. Sometimes it's a data set that's broadly distributed publicly after a major incident such as the Ashley Madison attack, other times people who have the data themselves (often because they're trading it) provide it to me directly and increasingly, it comes via reporters who've been handed the data from those who've hacked it.

I don't trust any of it. Regardless of where it's come from or how confident I "feel" about the integrity of the data, everything gets verified. Here's a perfect example of why: I recently wrote about How your data is collected and commoditised via "free" online services which was about how I'd been handed over 80 million accounts allegedly from a site called Instant Checkmate. I could have easily taken that data, loaded it into Have I been pwned (HIBP), perhaps pinged a few reporters on it then gone on my way. But think about the ramifications of that...

Firstly, Instant Checkmate would have been completely blindsided by the story. Nobody would have reached out to them before the news hit and the first they'd know of them being "hacked" is either the news headlines or HIBP subscribers beating down their door wanting answers. Secondly, it could have had a seriously detrimental effect on their business; what would those headlines do to customer confidence? But thirdly, it would have also made me look foolish as the breach wasn't from Instant Checkmate - bits of it possibly came from there but I couldn't verify that with any confidence so I wasn't going to be making that claim.

This week, as the news I mentioned in the intro was breaking, I spent a great deal of time verifying another two incidents, one fake and one legitimate. Let me talk about how I did that and ultimately reached those conclusions about authenticity.

Breach structure

Let's start with an incident that has been covered in a story just today titled One of the biggest hacks happened last year, but nobody noticed. When Zack (the ZDNet reporter) came to me with the data, it was being represented as coming from Zoosk, an online dating site. We've seen a bunch of relationship-orientated sites recently hacked and that I've successfully verified (such as Mate1.com and Beautiful People) so the concept of Zoosk being breached sounded feasible, but had to be emphatically verified.

The first thing I did was look at the data which appears like this:
Alleged Zoosk accounts

There were 57,554,881 rows of this structure; an email address and a plain text password delimited by a colon. This was possibly a data breach of Zoosk, but right off the bat, only having email and password makes it very hard to verify. These could be from anywhere which isn't to say that some wouldn't work on Zoosk, but they could be aggregated from various sources and then simply tested against Zoosk.

One thing that's enormously important when doing verification is the ability to provide the organisation that's allegedly been hacked with a "proof". Compare that Zoosk data (I'll refer to it as "Zoosk data" even though ultimately I disprove this), to this one:

Fling.com data

This data was allegedly from fling.com (you probably don't want to go there if you're at work...) and it relates to this story that just hit today: Another Day, Another Hack: Passwords and Sexual Desires for Dating Site 'Fling'. Joseph (the reporter on that piece) came to me with the data earlier in the week and as with Zack's 57 million record "Zoosk" breach, I went through the same verification process. But look at how different this data is - it's complete. Not only does this give me a much higher degree of confidence it's legit, it meant that Joseph could send Fling segments of the data which they could independently verify. Zoosk could easily be fabricated, but Fling could look at the info in that file and have absolute certainty that it came from their system. You can't fabricate internal identifiers and time stamps and not be caught out as a fraud when they're compared to an internal system.

Here's the full column headings for Fling:

CREATE TABLE `user` (`duid` int(10) unsigned NOT NULL AUTO_INCREMENT, `username` varchar(64) NOT NULL, `password` varchar(32) NOT NULL, `email` varchar(255) NOT NULL, `email_validated` enum('N','Y') NOT NULL DEFAULT 'N', `accept_email` enum('N','Y') NOT NULL DEFAULT 'Y', `accept_im` enum('N','Y') NOT NULL DEFAULT 'Y', `md5` varchar(32) NOT NULL, `membership` enum('FREE','PROMO','GRANDFATHERED','BRONZE','SILVER','GOLD','ADMIN') NOT NULL DEFAULT 'FREE', `join_date` datetime NOT NULL, `birth_date` date NOT NULL, `location_id` varchar(8) NOT NULL, `gender` enum('COUPLE','MAN','WOMAN','TS','UNSPECIFIED') NOT NULL DEFAULT 'UNSPECIFIED', `seeking` set('COUPLE','MAN','WOMAN','TS','UNSPECIFIED') NOT NULL DEFAULT 'UNSPECIFIED', `interested_in` set('FETISH','GROUPSEX','SEXUAL RELATIONS','ONLINE FLIRTING','OTHER','UNSPECIFIED') NOT NULL DEFAULT 'UNSPECIFIED', `last_login` datetime NOT NULL, `mobile_user` enum('N','Y') NOT NULL DEFAULT 'N', `mobile_phone_no` varchar(16) DEFAULT NULL, `mobile_carrier` varchar(20) DEFAULT NULL, `discreet_profile` enum('N','Y') NOT NULL DEFAULT 'N', `featured_profile` enum('N','Y') NOT NULL DEFAULT 'N', `power_user` enum('N','Y') NOT NULL DEFAULT 'N', `account_status` enum('ACTIVE','USER_DISABLED','ADMIN_DISABLED','SCAMMER_DISABLED') NOT NULL DEFAULT 'ACTIVE', `advert_id` varchar(25) DEFAULT NULL, `ip_address` varchar(16) NOT NULL, `mtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`duid`), UNIQUE KEY `username` (`username`), UNIQUE KEY `email` (`email`), KEY `location_id` (`location_id`), KEY `md5` (`md5`), KEY `join_date` (`join_date`), KEY `ip_address` (`ip_address`), KEY `password` (`password`), CONSTRAINT `user_ibfk_1` FOREIGN KEY (`location_id`) REFERENCES `geo_location` (`location_id`) ON UPDATE CASCADE) ENGINE=InnoDB AUTO_INCREMENT=64192949 DEFAULT CHARSET=utf8;

The other thing in terms of structure is that the Fling data begins with this:

-- MySQL dump 10.11
--
-- Host: 192.168.1.28    Database: fling
-- ------------------------------------------------------
-- Server version	5.1.41-enterprise-gpl-advanced-log

It's a mysqldump of the data with enough version and host info to again, create a much higher degree of confidence in the data not just for me in terms of how it "feels", but for Fling themselves to be able to verify.

I'm very suspicious of data presented in the way the Zoosk breach was and compared to Fling, you can see how both would impact my confidence levels in different ways. Let's move on though and increase that confidence level a bit.

Enumeration

Most websites will tell you if an email address exists on the site, you just need to ask. For example, enter an email address into Adult Friend Finder's password reset feature and they'll tell you very clearly if it's already in their database or not. It's not always that explicit, Ashley Madison used to disclose account existing by returning slightly different responses. If a site isn't facilitating enumeration on the password reset, then it frequently is on the registration feature ("this email address is already registered") and it's rare not to be able to simply plug in an email address and be told via one channel or another if it already exists on the site.

Enumeration risks such as these are not "silent" in that something like a password reset will send an email to the recipient. Whilst it's by no means compromising their personal security in any way, I also don't particularly want to inconvenience people. But there's a way around that and it provides another upside too.

Mailinator accounts in data breaches

If you haven't used Mailinator before, you're missing out. It's an awesome way of standing up free, disposable email addresses and you can simply send a mail to [anything]@mailinator.com then check it on their site. There's also zero security and consequently, zero privacy. People often use Mailinator accounts simply as a means of passing the "please verify your email address" test that many sites pose before you can access them.

Mailinator accounts are perfect for testing enumeration risks. For example, the email address bigbob******@mailinator.com is the first one in Fling and if you plug that into their password reset form, you get this:

Fling confirming an email exists

Curiously, Fling returns exactly the same message when the email is entirely fabricated; fat-finger the keyboard and you'll get the same response. In that regard, password reset may not be an enumeration vector on Fling but it doesn't matter because when testing a Mailinator account, the reset email is publicly accessible anyway:

Fling password reset sent to a Mailinator account

It turns out that Big Bob also has a password of commensurate security to his choice of mail provider, and this gives us another verification data point:

Mailinator password match

Of course you can only do this with a breach where the site actually emails the password which (fortunately) isn't that common, but you can see how each of these processes starts to build confidence in the authenticity of the breach. That can be confidence that it is genuine as well as confidence that it isn't.

The Zoosk data had way too many accounts that weren't checking out. Some Mailinator accounts would cause their password reset to respond confirming an email had been sent but many others didn't. It's possible that accounts had been deleted from their end post-breach (sometimes this is just a "soft" delete - the record is still there but flagged as inactive), but the low hit-rate wasn't inspiring much confidence.

But there's another avenue I have available that's proven very reliable, and that's HIBP subscribers.

Verifying with HIBP subscribers

I'm now approaching 400k verified subscribers to HIBP, that is they've gone to the free notification service page, entered their email address then received an email at that address and clicked on a verification link. These are people who have an interest in protecting their online identities and they want to know about it when an incident occurs that impacts them.

What I've been doing with breaches that are harder to verify or I that want to have a greater degree of confidence in, is temporarily loading the email addresses into the SQL database in HIBP which stores the notification users (this doesn't contain the accounts the service allows you to search, those are stored in Azure Table Storage), then running a query that gives me results like this:

Recent HIBP subscribers in the Zoosk data

These are the most recently verified HIBP subscribers who appear in the Zoosk data or in other words, those who have a recent recollection of signing up to the service I run. I'll take 30 of those and send them an email such as this one:

Hi, I’m emailing you as someone who has recently subscribed to the service I run, "Have I been pwned?"

I’m after your support in helping to verify whether a data breach I’ve been handed is legitimate or not. It’s one that I need to be absolutely confident it’s not a fake before I load the data and people such as yourself receive notifications. This particular one is quite personal hence the extra due diligence.

If you’re willing to assist, I’ll send you further information on the incident and include a small snippet of your (allegedly) breached record, enough for you to verify if it’s accurate. Is this something you’re willing to help with?

I send this off with everyone BCC'd so inevitably a bunch of them go to spam whilst others are ignored or simply not seen for quite a while hence why I email 30 people at a time. People who *do* respond are always willing to help so I send them back some segments of the data to verify, for example:

This relates to the website fling.com which an attacker has allegedly breached. Your email address is in there with the following attributes:

1. A password that begins with “[redacted]”
2. An IP address that belongs to [redacted] and places you in [redacted]
3. A join date in [month] [year]

Does this data seem legitimate? Other indicators suggest it’s highly likely to be accurate and your confirmation would be enormously helpful.

I sent this exact message back to a number of HIBP subscribers in the Fling data set and all of them confirmed the data with responses such as this:

That is indeed accurate. Lovely plaintext password storage I see.

There's a risk that people merely respond in the affirmative to my questions regardless of whether the data is accurate or not. However firstly, I've already found them in the breach and reached out to them - it's already likely they're a member. Secondly, I rely on multiple positive responses from subscribers so we're now talking about people lying en masse which is much less likely than just one person with a confirmation bias. Finally, if I really feel even greater confidence is required, sometimes I'll ask them for a piece of data to confirm the breach, for example "what month were you born in".

The Fling data was emphatically confirmed. The Zoosk data was not, although some people gave responses indicating they'd previously signed up. Part of the problem with verifying Zoosk though is that there's just an email address and a password, both of which could conceivably have come from anywhere. Those who denied membership also denied they'd ever used the password which appeared next to their email address in the data that was provided to me so the whole thing was looking shakier and shakier.

Zoosk wasn't looking legit, but I wanted to try and get to the bottom of it which called for more analysis. Here's what I did next.

Other verification patterns

In a case like Zoosk where I just can't explain the data, I'll often load the data into a local instance of SQL Server and do further analysis (I don't do this in Azure as I don't want to put other people's credentials up there in the cloud). For example, I'm interested in the distribution of email addresses across domains:

Zoosk email addresses by domain

See anything odd? Is Hotmail having a resurgence, perhaps? This is not an organic distribution of email service providers because Gmail should be way out in front, not at 50% of Hotmail. It's more significant than that too because rows 4, 5 and 10 are also Hotmail so we're talking 24 million accounts. It just doesn't smell right.

Then again, what does smell right is the distribution of email accounts by TLD:

Email accounts by TLD

I was interested in whether there was an unexpected bias towards any one particular TLD, for example we'll often see a heap of .ru accounts. This would tell me something about the origin of the data but in this case, the spread was the kind of thing I'd expect of an international dating service.

Another way I sliced the data is by password which was feasible due to the plain text nature of them (although it could also be done with salt-less hashes as well). Here's what I found:

Password prevalence

With passwords, I'm interested in whether there's either an obvious bias in the most common ones or a pattern that reinforces that they were indeed taken from the site in question. The most obvious anomaly in the passwords above is that first result; 1.7M passwords that are simply the escape character for a new line. Clearly this doesn't represent the source password so we have to consider other options. One, is that those 1.7M passwords were uncrackable; the individual that provided the data to Zack indicated that storage was originally MD5 and that he'd cracked a bunch of the passwords. However, this would represent a 97% success rate when considering there were 57M accounts and whilst not impossible, that feels way too high for a casual hacker, even with MD5. The passwords which do appear in the clear are all pretty simple which you'd expect, but there's simply not enough diversity to represent a natural spread of passwords. That's a very "gut feel" observation, but with other oddities in the data set as well it seems feasible.

But then we have indicators that reinforce the premise that the data came from Zoosk, just look at the 11th most popular one - "zoosk". As much as that reinforces the Zoosk angle though, the 17th most popular password implicates an entirely different site - Badoo.

Badoo is another dating site so we're in the same realm of relationship sites getting hacked again. Not only does Badoo feature in the passwords, but there are 88k email addresses with the word "badoo" in them. That compares to only 6.4k email addresses with Zoosk in them.

While we're talking about passwords, there are 93k on them matching a pattern similar to this: "$HEX[73c5826f6e65637a6e696b69]". That's a small portion of the 57M of them, but it's yet another anomaly which decreases my confidence in the data breach being what it was represented as - a straight out exploit of Zoosk.

Another really important step though is actually confirming a breach with the owner of the site that allegedly lost it. Let's delve into that.

Verifying with the site owner

Not only is the site owner in the best position to tell whether the breach is legit or not, it's also just simply the right thing to do. They deserve an early heads up if their asset has been accused of being hacked. However, this is by no means a foolproof way of getting to the bottom of the incident in terms of verification.

A perfect example of this is the Philippines Election Committee breach I wrote about last month. Even whilst acknowledging that their site had indeed been hacked (it's hard to deny this once you've had your site defaced!), they still refused to confirm or deny the legitimacy of the data floating around the web even weeks after the event. This is not a hard job - it literally would have taken them hours at most to confirm that indeed, the data had come from their system.

One thing I'll often do for verification with the site owner is use journalists. Often this is because data breaches come via them in the first place, other times I'll reach out to them for support when data comes directly to me. The reason for this is that they're very well-practiced at getting responses from organisations. It can be notoriously hard to ethically report security incidents but when it's a journalist from a major international publication calling, organisations tend to sit up and listen. There are a small handful of journalists I often work with because I trust them to report ethically and honestly and that includes both Zack and Joseph who I mentioned earlier.

Both the breaches I've referred to throughout this post came in via journalists in the first place so they were already well-placed to contact the respective sites. In the case of Zoosk, they inspected the data and concluded what I had - it was unlikely to be a breach of their system:

None of the full user records in the sample data set was a direct match to a Zoosk user

They also pointed out odd idiosyncrasies with the data that suggested a potential link to Badoo and that led Zack to contact them too. Per his ZDNet article, there might be something to it but certainly it was no smoking gun and ultimately both Zoosk and Badoo helped us confirm what we'd already suspected: the "breach" might have some unexplained patterns in it but it definitely wasn't an outright compromise of either site.

The Fling breach was different and Joseph got a very clear answer very quickly:

The person who the Fling.com domain is registered to confirmed the legitimacy of the sample data.

Well that was simple. It also confirmed what I was already quite confident of, but I want to impress how verification involved looking at the data in a number of different ways to ensure we were really confident that this was actually what it appeared to be before it made news headlines.

Testing credentials is not cool

Many people have asked me "why don't you just try to login with the credentials in the breach" and obviously this would be an easy test. But it would also be an invasion of privacy and depending on how you look it, potentially a violation of laws such as the US Computer Fraud and Abuse Act (CFAA). In fact it would clearly constitute "having knowingly accessed a computer without authorization or exceeding authorized access" and whilst I can't see myself going to jail for doing this with a couple of accounts, it wouldn't stand me in good light if I ever needed to explain myself.

Look, it'd be easy to fire up Tor and plug in a username and password for say, Fling, but that's stepping over an ethical boundary I just don't want to cross. Not only that, but I don't need to cross it; the verification channels I've already outlined are more than enough to be confident in the authenticity of the breach and logging into someone else's porn account is entirely unnecessary.

Summary

Before I'd even managed to finish writing this blog post, the excitement about the "breach" I mentioned in the opening of this blog post had begun to come back down to earth. So far down to earth in fact that we're potentially looking at only about one in every five and a half thousand accounts actually working on the site they allegedly belonged to:

Mail.Ru analyzed 57 mil of the 272 mil credentials found this week in alleged breach: 99.982% of those are "invalid" https://t.co/fOrfJoZb12
— Lorenzo Franceschi-B (@lorenzoFB) May 6, 2016

That's not just a fabricated breach, it's a very poor one at that as the hit rate you'd get from simply taking credentials from another breach and testing them against the victims' mail providers would yield a significantly higher success rate (more than 0.02% of people reuse their passwords). Not only was the press starting to question how legitimate the data actually was, they were getting statements from those implicated as having lost it in the first place. In fact, Mail.ru was pretty clear about how legitimate the data was:

none of the email and password combinations work

Breach verification can be laborious, time consuming work that frequently results in the incident not being newsworthy or HIBP-worthy but it's important work that should - no "must" - be done before there are news headlines making bold statements. Often these statements turn out to not only be false, but unnecessarily alarming and sometimes damaging to the organisation involved. Breach verification is important.

Have I Been Pwned Security

Here's how I verify data breaches