One of the things I'm finding with running Have I been pwned (HIBP) is that over time, my approach is changing. Nothing dramatic thus far, usually just what I'd call "organic" corrections in direction and usually in response to things I've learned, industry events or changes in the way people are using the service. For example, the Ashley Madison hack led to the concept of a sensitive breach which meant ensuring that data from certain incidents is not publicly searchable. More recently I introduced API rate limiting as I was seeing the service being used in ways that worried me. Times change, things move on.
Recently, I came across a massive spam list with a bunch of personal data and it got me thinking again about whether this has a role to play in HIBP. In fact, the context was that someone sent it to me as a claimed "OneDrive breach", something I was highly suspicious of right from the outset (only last week I was lamenting how many "breaches" are entirely fake). But be that as it may, it actually had a lot of personal data in it and with a little digging around I tracked it down to a commercial spam list. Actually, it was very similar to the Special K spam list I wrote about in March, even the total numbers were close (31M in the first one, 33M in the second). They weren't the same thing (only 5.6M common accounts), but they adhered to the same pattern in terms of a large amount of personal data floating around the web, inevitably unbeknownst to the owners of it.
I thought I'd put some feelers out to see how people would receive this class of data appearing in HIBP, and the response painted a pretty clear picture:
If I have a MASSIVE spam list full of personal data being sold to spammers, should I load it into @haveibeenpwned?— Troy Hunt (@troyhunt) November 15, 2016
The caveat that accompanied many of the written responses to that tweet was that it should be clear when the data comes from an identifiable breach as opposed to when it's sold for spam purposes.
Here's the way I look at it all and I'm going to share both the negative and positive in terms of whether it makes sense to load it into HIBP: The main negative is that if it's loaded, then what? I mean someone finds themselves in a spam list, what can they actually do? They can't change their password like they can in a data breach nor do they really have any viable recourse against the people selling their data. There's really nothing actionable about all of this.
On the other hand, people want to know about their exposure. They want to know where their personal data appears and where it's being redistributed. So many times after loading a data breach I've had people contact me relieved that they've been able to attribute a potential source of abuse. Of course there are no guarantees that what they're seeing in HIBP is genuinely the source of the abuse, but it helps them complete the personal picture of their exposure. It's not just that, but HIBP subscribers expect to be told when I find them floating around the web; there are 10k subscribers in that latest list and 16k in the Special K one and they're going to want to know that they're in there.
But here's the clincher for me:
Spam lists containing personal information are indistinguishable from data breaches.
This is not just a collection of email addresses; we're talking about multiple personal identity attributes which are regularly used for nefarious purposes. For example, here's what's in that Special K spam list I mentioned:
email,ip,url,joindate,fname,lname,address,address2,city,state,zip,phone,mobile,dob,gender [redacted]@gmail.com,162.158.22.[redacted],instantcheckmate.com,2015-08-06,[redacted],[redacted],,,San Francisco,CA,94107,,,, [redacted]@gmail.com,70.198.4.[redacted],creditcardguide.com,2015-08-06,,[redacted],,,Mitchell,SD,57301,,,, [redacted]@gmail.com,166.137.139.[redacted],creditcardguide.com,2015-08-06,[redacted],[redacted],,,,,,,,,
Names, addresses, birth dates, genders, phone numbers - this is precisely the sort of data people are so worried about being misused. The only thing not in there that you'd usually see in breach data is a password. Even then, not all breaches have a password in them anyway. For example, the Modern Business Solutions breach didn't have them and neither did the Regpacks breach, although they all had many of the fields you see above in common and this is precisely the sort of data people want to know is circulating about them.
Including widely circulating spam lists helps HIBP users assess the overall exposure of their personal data. Subscribers expect to be notified when their data is misused in this way.
I've now integrated the concept of spam lists and have done a number of things to clearly identify what they are. Firstly, these incidents are flagged each and every time they're represented in the HIBP interface, firstly both via the title and with an icon:
That's if it makes one of the top 10 incidents presently represented on the home page (which neither of the lists I'm kicking off with do), and it's then explained clearly in the description that pops up when drilling into it:
Same again on the page listing all pwned sites:
There's also now a spam list entry on the FAQs page which explains much of what you've read here albeit more succinctly. That's linked from anywhere it makes sense to provide a definition which now also includes in notifications sent to HIBP subscribers:
And finally, the API docs now also include a reference to the attribute:
And that's pretty much everywhere you'll see spam lists represented in the system. I've tried to make it as absolutely transparent as possible and while I'm sure there'll be further tweaks (the nomenclature in combination with the word "breach" in certain places, for example), this all contributes the overall objective of helping people understand more about their exposure on the web.
So that's spam lists and the Special K incident is now live and searchable. Plus, as you'd expect, HIBP subscribers are now receiving notifications about something they almost certainly weren't aware of before. I've got that second incident I referred to at the start of this post to also load today then I'm sure that just as our data will continue to be unwillingly exposed by breaches, it will continue to appear in spam lists as well so inevitably we'll see a lot more of this in the future.