Troy Hunt: Introducing "fabricated" data breaches to Have I been pwned

I've written before about how I verify data breaches and discussed it at length in various conference talks. I take verification very seriously because misattribution can have serious consequences on the company involved, those in the alleged breach and indeed, on myself as well. To give you a sense of how much effort can go into verification, last month I wrote about a data breach investigation blow by blow where ultimately, I failed to verify the authenticity of the data. Due to the prevalence of legitimate data in there though, I still loaded it into HIBP and flagged it as "unverified", a concept I introduced in the middle of last year.

The point of unverified data breaches is that they have a lot of accurate personal information in them yet I'm unable to conclusively tie them back to any one particular service that's been breached. I introduced the concept as a way of representing the incident when my confidence dropped below a certain threshold. The more thought I've given this recently, the more I've been thinking about the whole "confidence" side of verification. Consider the following chart:

Verified and unverified breach confidence levels

There are a few important things I'd like to point out here:

Firstly, even for verified breaches, confidence is not always going to be 100%, there's a range at play here. For example, I was very confident when I verified the Dropbox data in part because they'd acknowledged the incident and in part because both my wife's and my strong, password-manager-generated totally random passwords were in there as bcrypt hashes. Other times, there's no acknowledgement from the company involved and I'm relying on a combination of the verification techniques I explained in that earlier post and a gut feel of "how likely was this site to be breached" (i.e. it's running unpatched vBulletin).

Secondly, this is subjective; I'm making a judgement decision here. A lot of how this system runs boils down to me making a call on how things should be classified. For example, it's up to me to decide if I flag a breach as "sensitive" therefore making it no longer publicly searchable. I don't always get this right either; yesterday I loaded the HongFire breach which is an anime and manga forum only to later have someone point out that the content on the site often involves some pretty hardcore cartoon porn. I subsequently flagged it as sensitive because one's presence in this incident could cause them pain were it to be publicly learned.

And thirdly, that chart above has a missing segment on the far right. Until now, there's been no construct for me to handle breaches within HIBP once my confidence level dropped beneath what I'd reasonably classify as an unverified breach. This brings me to "fabricated" breaches.

The catalyst for introducing this classification came when someone sent me a breach that had allegedly come from Justdate.com. As the name implies, it's a dating website and the data I was sent contained over 24M rows. My usual first glance verification attempts were inconclusive; on the one hand, the site looks like security isn't exactly a high priority (i.e. missing transport layer security on login and no ability to serve content over HTTPS), yet on the other hand, tried and tested techniques such as account enumeration were coming up negative (password reset and registration said the accounts in the alleged breach don't exist). However, it was broadly acknowledged that the site had been breached:

Justdate.com on Vigilante.com

This is from Vigilante.pw and I regularly use this site as a reference point for data that's believed to have been breached (it's only a list of alleged incidents, not a redistribution channel). There were many other references I found in various locations as well so clearly there's something to this.

I started contacting HIBP subscribers which is something I regularly do when I'm having trouble verifying a breach. I wanted to ask them questions about the data which looked like this:

user_id	first	last	email	dob	postcode	country	ip_country	alerts_email	alerts_online	unsubscribed	bounce
100	[redacted]	[redacted]	[redacted]	1992-03-03	PO5	GB	GB	1023	15	0	0

There wasn't a lot to go on, primarily just a name, email, DOB, country and postcode. Plus of course their presence on the dating site, something they'd likely recall signing up to. About a dozen people responded and as the replies started coming in, it quickly became clear that this data certainly wasn't going to fit in the left side of the graph above. Nobody - nobody - recalled signing up to Justdate.com. Some people had used other dating sites in the past, but others had never used a site of this nature at all. As more replies arrived, I started questioning whether it could even be classified as "unverified". Many people said the birth date was wrong. And the post code. And the country. Not all, mind you, there were certainly some valid entries there but it was less than 50% of respondents. My confidence level in the legitimacy of the incident fell beneath the threshold of what I felt comfortable loading into the system. This certainly wasn't the first case of this nature either.

That's when I started giving the whole thing more thought. Here we have tens of millions of records floating around the web alleging that people had participated in a dating site. Real people too because regardless of how legit the other data attributes were, the email addresses were accurate and they belonged to actual people, many of whom were relying on HIBP to let them know when their info was found. I started toying with the idea of whether it made sense to define a construct under which the data could be loaded, so I reached out and asked them:

Thanks for confirming, looks like this is very likely a fake in terms of coming from that site, but multiple subscribers have confirmed that parts of the data are legitimate. I may consider a new “fake” category in HIBP – do you think people such as yourself would like to know about the presence of this data with your email address even if it didn’t come from Just Date?

I'm going to list off the pertinent parts of each and every response I got to that email in the order with which they came in. Here's what HIBP subscribers had to say:

I think it would definitely be helpful to know where our data is appearing. I like your idea of considering a "fake" or "spoof" category.

I think yes, just for checking, we are not 100% safe & its better to check it even if its fake breach.

Absolutely it's good to know no matter what do you know the service goes onto the internet and delete all of your information more selective I guess I know that there's ways to delete everything but maybe more selective deletion program to find out where your information is and then ask you if you decide kind of like LifeLock but but more aggressive.

Yes I would appreciate any information that has a berring on my privacy. Thank you and I hope I have been some help to you.

Absolutely 100%. I use different emails for forums/subscriptions as I do for work, and another development only email. But that does not help me if my email makes it into a recruitment/spam DB, or even worse leaked in a data dump. I would be very interested to know whenever my name and email address was displayed in dumps.

Yes, I personally would like to know.

Yes I would certainly be interested in this information. It's nice to know who's got my details, even if they're somewhat incorrect.

These responses are exactly as they came in and they were unanimous - people still want to know their info is circulating even if the breach isn't legit. So I gave it some thought and ultimately added the following piece to the earlier chart:

Fabricated breaches

I ultimately elected to use the word "fabricated" rather than "fake" as I felt the latter implied nothing within there was real. However, as I mentioned earlier, a number of people had accurate dates of birth in the incident. As I also mentioned earlier, all of this is ultimately a judgement decision on my behalf and not an absolute; I have a very low degree of confidence that Justdate.com was breached and I think it's a highly unlikely proposition based on the data. There's a tiny possibility it's real, but they're not good odds.

This concept now manifests itself in several ways within HIBP. Firstly, as with unverified breaches, there's now a visual indicator next to every fabricated breach in the form of a triangle with an exclamation mark:

Justdate.com breach description

Because it's a dating site, the breach is also flagged as "sensitive" which is why the little flame is present.

Secondly, there's an attribute on the breach entity returned by the API called "IsFabricated" so consumers of the service can identify breaches of this nature. It means a breach such as Justdate.com appears like this when described in JSON:

{
	Title: "Justdate.com",
	Name: "JustDate",
	Domain: "justdate.com",
	BreachDate: "2016-09-29",
	AddedDate: "2017-02-07T01:28:41Z",
	PwnCount: 24451312,
	Description: "An alleged breach of the dating website...",
	DataClasses: [
		"Dates of birth",
		"Email addresses",
		"Geographic locations",
		"Names"
	],
	IsVerified: false,
	IsFabricated: true,
	IsSensitive: true,
	IsActive: true,
	IsRetired: false,
	IsSpamList: false
}

Note that I've still flagged this incident as "unverified" as well. There are people that have created dependencies on this attribute that deal with an unverified breach differently to one that I'm confident in. In an ideal world, I'd represent the three different states via one attribute (verified, unverified, fabricated), but existing dependencies mean I need to use each of those fields here.

Thirdly, as you'll see in the Justdate.com breach image above, I'm being very explicit in the description of the breach as to why it's being classed as fabricated. I want to give people as much information as possible so that they can understand what the data is and why I believe it isn't real. I also want to make sure it's abundantly clear that there's insufficient evidence to suggest that the company a fabricated breach has been attributed to did indeed have a breach at all.

And finally, the notification emails sent to subscribers clearly indicates the fabricated status and of course also include the description of the data explaining why I've flagged it as such:

Notification email

~~(And no, my account isn't in the data! Just testing...)~~ I was so certain there was absolutely no reason my account would be in there that I didn't even check. And then I got a breach notification from HIBP! So now, being a "victim", I'm glad I know and I concur with the other comments from subscribers above.

There's a whole other discussion to be had about what causes a bundle of data to be fabricated and called a breach in the first place. Attempts to monetise the data by selling the alleged breach, extortion of the company involved or just simple big-noting by individuals seeking notoriety are all feasible explanations for many of the fabricated breaches I see. For now, the important thing is that if your data is circulating in one of these dumps, there's now a way to know about it.

The Justdate.com data is now in HIBP. Because it's also a sensitive breach, you can only search though it by using the free notification service. I'll load more existing fabricated breaches as time permits and inevitably, as new ones emerge in the future.

Have I Been Pwned

Introducing "fabricated" data breaches to Have I been pwned

Troy Hunt

Upcoming Events

Must Read