Troy Hunt: Introducing paste searches and monitoring for “Have I been pwned?”

I’ve got 174,451,409 breached accounts in Have I been pwned? (HIBP) as of today which probably sounds like a lot, but it’s not. Why is it not a lot? Because whilst that list spans a lot of the big breaches I could get my hands on, as of the middle of this year (now a couple of months ago already), there were over half a billion accounts breached in just six months. That’s just nuts and as that article explains, its set us on a track that will make 2014 the most hacked year to date by a fairly significant margin over last year which was the previous most hacky year.

Every time a major breach occurs (and frequently after smaller ones too), I go through the process of seeking any publicly dumped data (which frequently can’t be found, it remains in the attackers’ hands) then verifying the legitimacy of the breach and when it checks out, publishing the data to HIBP. It can be time consuming and labour intensive if I want to avoid any false-positives then combine that with the fact that only a small portion of breaches ever see the light of day and you realise that despite my best efforts, I’m really only scraping the surface of pwned data.

But along with those massive, sporadic public dumps, there’s another channel by which we very frequently see breached accounts appear and that’s “pastes”. As it turns out, there are literally tens of thousands of email address appearing in pastes every day unbeknownst to the rightful owner. Also unbeknownst to them is that alongside their email address is often their password and other personal data, all on public display for anyone who cares to look.

But I’m getting ahead of myself here, let me explain what a paste is and the relevance to HIBP.

What’s a paste and what does it have to do with pwned accounts?

There exists a concept colloquially known as “pastes”. A paste is nothing more than text quite literally pasted onto a website whereupon it receives its own unique URL so that it can then be shared with others who may want to view the paste. The contents of a paste could be anything – a recipe, a block of code or of particular interest here, a dump of breached accounts. The latter often turns up on paste websites for a few key reasons:

Creating a paste is a very low-friction process, it literally involves copying text onto the clipboard then pasting it onto a paste site
It’s an extremely easy means of distribution as the attacker simply shares the resultant unique link to the paste
It can be totally anonymous as there’s not usually a requirement to create a login or identify yourself

What it means is that you see a lot of this sort of thing:

This is (allegedly) a dump of the database behind rabbitears.info. This is a pretty typical pattern – an introduction followed by a dump of the database, in this case including the user ID and username, a hashed password and an email address. In this case, the dump is on Pastebin and whilst this site has a huge stake of the paste market, there are other smaller players like Pastie and Slexy which also often feature similar data.

So wait – Pastebin and co are a huge dumping ground of illegally obtained data? Yes, despite their very explicit acceptable use policy:

Please do NOT post:
- email lists
- login details
- stolen source code
- password lists
- personal information / data

Good intent (or legal arse-covering, depending on how you look at it), but the fact remains that Pastebin is a veritable treasure trove of breached user credentials. In fact it’s very frequently used as the first place a new breach is announced and the likes of “Anonymous” and their hacktivist peers have always been very fond of this style of release.

Getting to the point, if you could monitor Pastebin and identify these dumps then make them searchable in the same way breaches in HIBP have always been searchable, that would be a rather neat little tool. However, with tens of millions of pastes and thousands of new ones every day, this is no small task. A little creativity was needed.

The role of Dump Monitor

There exists a Twitter account named @dumpmon and it looks like this:

Recent tweets from @DumpMon

Dump Monitor monitors Pastebin, Pastie and Slexy for the occurrences of several different likely data breach patterns including email addresses, hashes and API keys. It was created by a very clever bloke by the name of Jordan Wright and I’ve been working with him to make the data available to HIBP. I’ll write a follow-up post to this with a lot of technical details later on but in simple terms, HIBP has a service which monitors @dumpmon tweets, looks for ones which announce the presence of emails then goes off to Pastebin (or other) and retrieves the data. Any identified emails go into HIBP and are immediately searchable.

One thing worth pointing out now is that this is not an exact science. Jordan has some algorithms to identify likely data breaches as opposed to innocuous lists of email and he’s done a great job of sorting the wheat from the chaff, but inevitably there are going to be false positives. It’s certainly a small portion of the pastes his service tweets (at least that’s my anecdotal observation after looking at heaps of them over the last few months), but it’s the sort of thing where if your email address was in a paste, you’d still want to take a look at it and decide for yourself whether it actually presented any risk to you or not.

So that’s the background on pastes, why they matter and how HIBP gets them, let me show you how the search now looks.

Searching for pastes in HIBP

Here’s how the new search looks:

Searching for pastes and breaches

Uh, isn’t this almost the same as the old search? Yes, precisely, it’s exactly the same user experience to begin the search and that was a very conscious decision. I wanted to keep the process absolutely seamless. Whilst it was suggested by some, I didn’t feel it made sense to give people two different searches, I’d rather a unified user experience then whatever information is relevant to them will appear in the search results. It means that a search result now looks like this:

Searching for fo@bar.com with 3 breaches and 4 pastes found

One change I’ve made here is to add the breach description you see above. I did extensive beta testing on the new paste feature (more on that later), and one of the clear messages I got was to make the information on the site more easily consumable by mere mortals (we sometimes call them “consumers”). The breach description is one new addition, the “Compromised data” at the bottom of each breach is another. I felt it was important to give them more context on what data they might have at risk of the breach and this was the right place to surface it.

Moving on to the pastes, this now appears under the list of breaches, assuming any were found, of course:

A list of pastes for foo@bar.com with title, date and email count

I’m providing as much information as I can here to try and help people determine the risk. Often the paste title will make the context pretty clear, but equally often there’s no title at all. Either way, I link through to the paste so that people can review the data and decide for themselves. Obviously the date of the paste is useful for time context and the emails found in the paste gives a sense of whether an address was very targeted or just caught up in a larger dump.

It’s another thing I’ll write more about later on, but the paste search actually happens asynchronously with the existing breached sites search so the response is just as fast as ever (both searches take a comparable amount of time). Of course it also means more requests going to the server(s) but as I wrote about in Scaling a standard Azure website to 380k queries per minute of 163M records with loader.io, I’ve got a heap of scale left to go!

The searches from the front page are great and I’m enormously happy with how they’ve turned out, but there’s much, much more to it than that and I’ll start out by taking you through the notification service.

Notifications

Shortly after HIBP launched, I added a notification service so that you could subscribe (for free, I might add!) and be told if your email address showed up in a subsequent breach loaded into the system. Since then there’s been many new breaches and a heap of notifications so clearly it’s (somewhat unfortunately) proving useful to people.

The paste service extends the notification feature so that subscribers get told about their appearance on Pastebin as soon as it happens. I’ll talk more about how fast everything happens shortly, but on average we’re looking at minutes and sometimes even seconds after a paste appears that a subscriber has an email in their inbox telling them about it. That’s enormously useful as it gives them the earliest possible warning of the incident so that they can go and change passwords or cancel credit cards or take whatever other action is required based on the nature of the data in the paste.

The notification looks like this:

A sample notification of an account appearing on Pastebin

Pretty self-explanatory and it’s now in effect for all existing subscribers of the service. Like everything else in HIBP, it’s entirely free and you can subscribe now from the “Notify me” link in the navigation:

The "Notify me" link in the navigation

At the time of writing, there are around 80k verified subscribers in the system and another 25k that never responded to the verification email that gets sent after signup (spam filters, someone else signing them up etc.) All of these subscribers now have paste monitoring by default.

That’s notifications, but I’ve extended other existing services as well. Let’s look at domain searches.

Domain searches

I introduced domain searches back in Jan and you can read that post if you want all the details but in a nutshell, domain searches are a means of a verified domain owner searching for *@mydomain.com. A lot of organisations are now using this feature to identify breaches that impact their staff and may lead to other events such as attacks that exploit credential reuse or spear phishing attacks. Domain searches also offer the ability to get notifications so those responsible for domains can keep abreast of future incidents that might impact their organisation.

As of now, pastes are fully integrated into domain searches and domain notifications. It means that when you search across a domain, you get results like this:

A list of pastes found in a domain search

This is a real live search against a domain that has a large number of emails that have appeared in breaches as well as 48 instances of emails in pastes.

As with the notifications discussed earlier, those who have subscribed to domain notifications now also get an email when an email address on a domain they’re monitoring appears in a paste. It looks like this:

A sample domain notification email

As with notifications sent to individual subscribers, these happen almost instantaneously after a paste is retrieved so it’s something that those responsible for the domain can action very quickly if required.

The domain search feature has been a real hit with everyone from individuals who run multiple addresses of their own personal domain to large institutions monitoring the accounts of hundreds of thousands of colleagues. On the latter, I’ve worked directly with a number of financial institutions who are monitoring a large number of domains across their portfolio (there are often various products and services email addresses get attached to) so if you need help setting up something similar where the existing constructs make verification laborious or impossible, email me and I’ll help get you set up behind the scenes.

The API

I’ve had a public API available from the very early days, originally launching in December last year. I revised it and went to version 2 in Feb which introduced a bunch of new ways of querying the data and getting more info out of the system. I’ll continue to refine that version and certainly there are very specific things I already have in mind to make it significantly faster for certain use cases. I often see API activity spike over the period of an hour or two so clearly people are getting use out of it, I assume to check specific lists of data.

For now though, paste searches have been fully integrated into the API and documented accordingly. It means you can make a request like this:

GET https://haveibeenpwned.com/api/v2/pasteaccount/foo%40bar.com

And get a response like this:

[
  {
    Source: "Pastebin",
    Id: "wXb5W8GV",
    Title: "#freenode-log",
    Date: "2014-07-06T19:07Z",
    EmailCount: 187
  },
  {
    Source: "Pastebin",
    Id: "EE8GM0ed",
    Title: null,
    Date: "2014-03-26T17:03Z",
    EmailCount: 80
  },
  {
    Source: "Pastebin",
    Id: "8Q0BvKD8",
    Title: "syslog",
    Date: "2014-03-04T19:03Z",
    EmailCount: 139
  },
  {
    Source: "Pastebin",
    Id: "C4GdBDnP",
    Title: "#secuinside13 logs",
    Date: "2013-05-26T22:05Z",
    EmailCount: 255
  }
]

The sources are either “Pastebin” (which it almost always is), “Slexy” and “Pastie” and the ID is theirs so that first result means the paste is over at http://pastebin.com/wXb5W8GV

As with the existing APIs, there’s no authentication required, no rate limiting and CORS is fully supported if you want to integrate a direct API hit via the browser from another service.

Last thing on the API: if you’re using it either for breaches or the new paste feature, I’d really love to hear about how you’re using it and particularly what can be done to make it more effective for you. There are a lot of discrete use-cases that can be better serviced by providing more options for how data is queried, what data is returned (i.e. returning smaller responses) and other tweaks that can make it an altogether more useful service. Ping me.

The unrelenting velocity of pastes

One of the things that really struck me about pastes is the absolutely non-stop nature of them. Whilst there might be in the order of 174M breached accounts in HIBP versus “only” about 4M paste accounts, breaches are few and far between and require quite a whack of effort on my part. Pastes, however, are flowing in all the time, every day and in fact usually every hour.

Let’s just take the last two months before the time of writing. Here’s the velocity of emails found in pastes each day:

A graph of emails found in pastes by day

Now there are actually a few gaps there (DumpMon may have been having a little nap), but what you’re looking at over this 62 day period is 2,037 pastes containing 1,177,473 emails. In terms of rate, that’s 33 pastes and 19k emails a day and the vast majority of those are email addresses the owner has no idea are sitting out there on Pastebin. Sometimes they might be completely indifferent about this, other times (such as when it accompanies personal data), they probably want to care a great deal.

If you want to get a sense of the rate of new pastes and have a look at specifically what they contain, check out the latest pastes page which looks like this:

A list of the latest pastes on HIBP

New pastes continue to flow at a pretty non-stop rate, the trick to doing something really useful with them is getting them into HIBP as fast as possible. Let me show you just how fast…

Getting pastes into HIBP fast

I had a goal from the outset for paste processing speed – I wanted it in HIBP and searchable within 60 seconds of the paste appearing on Pastebin. There’s a lot that needs to happen in that time, including:

Paste first appears
Dump Monitor retrieves it… (let’s not trivialise this step, there’s a huge volume of pastes appearing on Pastebin every single day)
…then figures out if it contains any emails…
…then tweets it out if it does
HIBP retrieves the tweet from the feed…
…then inspects it to see if it’s reporting emails are present…
…then goes out to Pastebin and downloads it…
…then extracts all the unique emails…
…then stores everything in the system (after this it then sends the notifications too)

As I’ve mentioned a couple of times, I’m going to write up a detailed design blog post on how all this actually works a little later on, but there are all sorts of issues here ranging from avoiding getting blocked by Pastebin to Twitter rate limits to actually orchestrating everything in a scalable, idempotent and ultimately expeditious fashion.

So how does it all stack up speed wise? I just pulled the last 432 pastes from the system (everything since I last actually touched the import process) and looked at the median values. Check this out:

DumpMon retrieving paste and tweeting in 28 secs, HIBP retrieving paste in 5 secs, HIBP importing emails in 4 secs

In other words, 33 seconds after the paste appears on Pastebin, the first emails in at are searchable on HIBP. 4 seconds later and every email from the paste is now in the system and searchable. I’m enormously happy with this, particularly given the scope within my control is really only that 9 second block at the end. (Incidentally, I could theoretically reduce the 28 second block if I could get the tweet earlier, but I’m maxing out the Twitter rate limit already by checking it every 5 seconds.)

The speed is great, but it does beg the question – why does it matter? I mean what difference would it make if someone couldn’t find their account on HIBP until, say, 5 minutes after the paste appeared? Well that’s not really the problem I’m trying to address here, but to explain it properly we need to talk about just how fickle pastes can be.

The transient nature of pastes

When I originally fired the service up, it went back through all the Dump Monitor tweets as far as the Twitter API would allow which is about 3.2k tweets. Given that at the time of writing Dump Monitor has going on 19k tweets, that’s a whole lot of data I didn’t have. Jordan kindly exported his tweet history and sent it over to me which means I had the full gamut of historical tweets to play with. That’s when the scale of something I knew already became completely clear: pastes of this nature are very transient.

What I found was that a huge number of pastes were no longer on Pastebin, they’d simply disappeared. It kinda makes sense when you take a look at Pastebin’s acceptable use policy I posted earlier – the data that’s of most interest to me is precisely the sort of data that shouldn’t be on Pastebin in the first place! So it gets deleted, sometimes I assume that’s as a result of a complaint (there’s a “report abuse” link at the top of every paste), and other times the paster inevitably removes it themselves for a variety of reasons (i.e. they wanted to briefly share some data then delete it before anyone else got hold of it).

What it all means is that the speed of the HIBP import process is absolutely essential. The longer the lead time between the original paste and when I try to retrieve it, the greater the likelihood it will be deleted before I get there. The very first picture in this blog post is a perfect example – the paste is now gone but those email addresses are still searchable and the owners can discover that their accounts were indeed breached. This is precisely why that 28 second period above is so important because it means there’s a very small window of opportunity to miss grabbing it.

Talking of paste retention, one thing worth nothing is that I don’t save the original paste. The only data saved in HIBP is the email address and other metadata such as the title, date and number of emails found in it. Many of these pastes have user credentials in them and many of them are plain text. I don’t need the responsibility of handling these and as I’ve always said, you can’t lose what you don’t have. What this also means is that I can’t show what I don’t have; if the paste is deleted from Pastebin, there’s nothing by which to gauge the impact on individuals who were mentioned there other than whatever metadata I’m able to show. On balance, I believe that’s the most responsible approach.

I got pwned in a paste (kind of)

Right towards the tail end of the testing of this new feature, I did a search on my own email address. Now I expected to be pwned on a breach (yeah, yeah, Adobe just like everyone else), but I didn’t expect to see this:

My email address appearing on a paste

Turns out someone had taken the liberty of posting this paste:

A dump of the "Supercar Showdown" database including my email address

Crikey! Well kind of crikey, it turns out that this is a dump of Supercar Showdown, the sample vulnerable app I used for my top rating Pluralsight courses Hack Yourself First and more recently, Hack Your API First. It’s meant to be vulnerable and it’s full of holes including things like SQL injection which was inevitably the vector used to suck out the data you see in the paste above. So it’s actually good that this is here; unlike the Adobe situation (damn you guys!), I wasn’t actually pwned but just as Supercar Showdown is a great demo for a vulnerable website, my account from there appearing in Pastebin and having been automatically picked up by HIBP is an excellent example of the paste service.

What will be of interest now that this new feature is launched is how many other people find themselves unexpectedly pwned by way of paste. I’m very interested to hear stories of examples so assuming you go out and do a search after this, do leave a comment and let me know if you found your address on a paste and can share the context.

Seeding additional sources of data

When I designed the paste service, in my mind what I was designing was a service that could take a URL, identify potentially breached accounts then make them searchable. That is all. The significance of that statement is that I expect there will be other sources of likely compromised data that can be added to this service.

I’d really like to hear from you, that’s especially the folks that lurk within the circles that frequently see this sort of data floating around the webs. If there are other common sources of breach data either on the clearnet or in the undergrounds that can be consumed in a similar fashion, I’d love to hear about it. Same again for any other Twitter accounts that might help contribute to building a broader inventory of breached accounts.

Send me your ideas folks, I’m only nine months into this service so it’s very early days yet.

Supporting HIBP with your donations

I was enormously surprised earlier this year when people actually wanted to give me money for a free service! I don’t want to charge for the service as it stands not only because it would obviously hinder uptake, but it would require a bunch of extra effort on my part. There’s also the fact that at the end of the day, these are peoples’ personal details that have been compromised and charging them for the right to know where they were compromised and be notified when it happens just doesn’t seem right to me. Yes, I know that other services do it and I’m sure they have their reasons (which are probably pretty obvious), but I just don’t need to. As I explained in that post above, the services I use are very cheap and very flexible and there’s less cost to cover than what I’m spending on coffee (that’s a separate problem!)

What I decided to do instead was focus primarily on the personal cost to providing the service, that is the sacrifices I make and the things I spend money on that help me do what I do, namely buying coffee and alcohol! The donations page talks about 10 things:

If you love the service and want to show your support, donations are always very warmly received! But support doesn’t have to be monetary, do leave your feedback on any of the things I’ve written about here or observe on the site itself. There are so many little ways the service can be improved and I’d love your feedback on these, no matter how small. Speaking of improving the service, one group has been enormously helpful – the very awesome HIBP beta testers.

Thanking beta testers and SendGrid

I had about two dozen people offer to beta test this service over the last month and I had heaps of awesome feedback. This is from people who were happy to give up their time to help make this service better in all sorts of different ways. I had feedback that ranged from typos to colour contrast to some pretty fundamental functional suggestions. Most of the feedback was implemented and what didn’t make the cut was not due to the quality of what was being suggested, but rather how it fit with the overall usability of the service and the roadmap I see for it in the future.

I also got great support from SendGrid courtesy of them sponsoring a bulk notification to all 80k verified subscribers right after publishing this blog post. It saved me some cash and they were very helpful with a few logistical things too. I originally used the SendGrid service because it’s baked into Azure as an add-on and then later used their loader.io service when I scaled a standard Azure website to 380k queries per minute of 163M records. Great products, great support and their generosity has helped HIBP grow in a number of ways now. Thanks guys!

So thanks everyone for supporting my little project, I’m enormously happy with the result and there’ll be much more to come yet. Do share the service generously with others and stay tuned on the HIBP Twitter account and on this website for more.

Security Have I Been Pwned

Introducing paste searches and monitoring for “Have I been pwned?”