It’s about six weeks into the life of Have I been pwned? now and I’m enormously pleased with the reception its received. The fact that I’ve had to write posts like the micro optimisation one or the one about getting too big for Google and had to deal with all the problems I’ve discussed there has actually been a very rewarding process and by all accounts, people have enjoying reading about it and using the new features I’ve been introducing too.
I’ve had a bunch of very awesome suggestions over that last month and a half and two in particular have kept coming up over and over again. The first one is the subscription service so that you can be automatically notified when you’re pwned in a future data breach. I pushed this out on Xmas eve and as of now there are 16,000 subscribers in the database.
The next one – and this is the focus of today’s post – is domain wide searches:
This is all about significantly broadening the scope of the pwned account search such that all accounts on a particular domain can be returned. There are actually some really good use cases for this so let me walk you through those then introduce the new service.
Why would you want to search by domain?
Good question! The first scenario is all about individuals who use multiple mail accounts or aliases on their own domain. For example, I might choose to register with Adobe using a mail account called email@example.com. If I start getting spam to that email address later on, I know who to blame. Whilst there are many people that asked for this feature for that exact reason, it’s also a bit niche; across the breadth of accounts that get pwned, those of us who actually have our own domains and manage mail records on them are a small few indeed.
Here’s what’s not so small – companies who are worried about their employee accounts being pwned. Now you might ask “Why would my employer care that I have an account pwned via Adobe”? One reason is that because people tend to be a bit naughty and reuse their passwords, that individual may have just introduced a risk to the organisation. An attacker gets the breached record for firstname.lastname@example.org and pulls his password from the dump, what do you reckon is the first thing he’s going to try when trying to compromise John’s mail account on Acme’s web mail environment? Exactly. Then of course we see other data that may heighten the risk an employee poses to their employer, for example data that may enable a social engineering attack against services the employee uses on the company’s behalf. Making it possible for employers to assess the breadth of impact when their staffs’ accounts are compromised allows them to make sensible decisions on how they want to handle it. For that, they need the data.
There are other use cases beyond these (penetration testers – sit tight – I know what you want!) but these are the ones that kept coming up over and over. Now I don’t just want anyone to be able to come and grab all the breached accounts for a particular domain, there has to be some due diligence and that’s why I’ve implemented a verification process.
Verifying domain authorisation
One of the things I’ve really focussed on with this project is usability. Everything has to be as easy as I can humanly make it which is why searches are ridiculously fast, everything is designed responsively and plays nice across devices like phones and tablets and I’ve made the barriers to use as low as I possibly can. The one exception – much to my chagrin – is the CAPTCHA. I explained in detail why I needed this in the post earlier this week so I won’t harp on it, suffice to say that I needed some verification of the organic nature of the being using the service (i.e. they’re not a “robot” automating or abusing the service).
With the domain wide searches, I wanted to establish a reasonable degree of confidence that the individual performing the search was indeed responsible for the domain. Often people have said “Hey, doesn’t this site just make it easy for attackers to figure out who’s been compromised?” to which I politely explain that being able to pull back single records is not only difficult to exploit en masse, it’s also easily substituted by an attacker simply downloading the same publicly available breaches that I have. Being able to pull large numbers of records in one go, however, is a different level altogether. It’s also a resource intensive process (at least compared to the existing mechanism of pulling single records), so I need to be sensitive to the impact allowing just anyone to, say, search for *@gmail.com would have on the system.
Enough of why verification is necessary, let’s jump into it and you can go and kick off a domain search right now by either going to that link or following the nav item on the site:
This’ll kick the process off and present you with the following UI:
Firstly, you obviously need a domain name and as you already know by now that there’s going to be a verification process but there’s a whack of bold text there just to reinforce the fact.
Next is the option to subscribe. Remember how I talked about the ability to get notified when your email address is pwned in the future? Well this is the domain-wide equivalent of that which means you can be notified when, say, anything @troyhunt.com shows up in a future breach (that’s if you can verify your authority over the domain!)
You don’t have to use an email address on the domain on order to receive notifications, you can use whatever you like and so long as you’ve verified the domain, you’ll get notified. These notifications won’t send you the actual email addresses that have been breached (many people are quite sensitive to the way this info is handled), but it will advise you of the scale and help you pull that data back from the website via the browser.
And then another CAPTCHA. Dammit. As you’ll see in a moment, there are multiple options within the verification process that can make life really hard on me if they get abused and as you’ll know from the previous post on why I need anti-automation, this is a risk I just can’t take at the moment. Anyway, let’s begin the process and I’ll walk you through those options.
Verifying by email
As I said earlier, I want to make everything about this site as easy as possible and when it comes to domain verification, the simplest possible way for many people will be to verify by email which is why this is the first thing you’ll see when verification begins:
What you’re seeing here is four different email addresses from two different origins. The first email address is not the one I entered on the previous screen! Ok, it is but that’s not where it’s come from, it’s come from a WHOIS service that has reported that this is one of the email addresses on the domain registration record. Sometimes there will be more than one address available to choose from (i.e. the registration record has different admin and tech contacts) and you can see what these look like for your domain on any WHOIS service that reports this info. This won’t work if you’re using a privacy service to obfuscate your identity as you’ll see their email address in first place which won’t do you much good when the site emails them for verification. One of the reasons I need the CAPTCHA is that I pay for the service that provides the WHOIS API and returns the email addresses on the domain so obviously I can’t have this getting abused for simple financial reasons.
The next three addresses are “canned” defaults, that is they’re preloaded into the system as they’re frequently used for administrative purposes. Particularly if you’re an individual and you have a “catch-all” model set up where you can send mail to anything at your domain then this will work for you. Then again, you might already be using one of these on your domain or can often just easily create one as an alias to an existing account.
Moving on, when you choose an address and send a verification email you’ll be presented with a prompt to enter the verification token:
No, this is not the same token as you’ll see a little later on (I know how some of you guys think!) rather it’s the one that’s just landed in your inbox:
I can now take that token and plug it into the verification box on the website and the job is done. That’s the fastest possible way to verify the domain, but I also know it’s not going to be feasible for everyone so I’ve provided a few more options.
Verifying by meta tag
This is pretty self-explanatory:
Geddit? If you control the website such that you can modify the HTML at the root of the site, you can verify the domain. I could easily use this option for troyhunt.com as I control the template used on the Blogger engine, indeed I’ve frequently done this with other services that required domain verification in the past. When you use this means of verification, an HTTP request is issued to the root of the site and the meta tag in the code block above is expected. It doesn’t matter if the site is HTTPS only so long as it responds with a 301 or 302 redirect to a resource that has the meta tag.
A quick point on tokens – the publicly visible ones such as the one in the meta tag and the ones you’re about to see are useless to anyone that observes them, they’re only usable via the verification process on the site. They also don’t need to be persistent and can be deleted after verification on breach retrieval is complete. The token sent by email earlier on is different as it needs to be something that only someone who can receive that email could possibly know otherwise anyone could easily verify the domain.
Of course some people don’t want to mess with their site markup and that’s just fine, there are other options.
Verify by file upload
This, again, is rather self-explanatory:
This works well where you have, say, easy FTP access to the site. Whack the text file up there, hit the verify button and let it make an HTTP request then nuke the file once you’re done. Dead easy.
There’s just one more option I want to cover.
Verify by TXT record
Know what a domain TXT record is? It’s like this:
Use TXT records to verify domain ownership or employ security measures, such as DKIM, DMARC, and SPF. You create TXT records using the administration tools available from your domain provider.
In short, you can jump into your domain provider’s control panel and add a bit of arbitrary information to the record that looks somewhat like this:
This is pretty non-invasive; you’re not changing the site in any way. It also may not work for subdomains so if you’re looking for breached accounts on something like tech.troyhunt.com then you’ll probably need to use one of the other three options.
Between those four avenues I reckon the vast majority of use-cases are pretty well covered so let’s take a look at what you can get out of a verified domain search.
Domain search results
Once everything checks out and you seem legit, here’s what you’ll get:
I’ve chosen those three data formats as I believe they’re the most universally recognised and cater for most of the ways I can conceive of people using the data. If you just want to take a quick look at the pwned accounts then grab it in the browser. If you want to save it offline or distribute it within your organisation then grab it in Excel. If you want to import it into another system then JSON may be your best bet. You choose.
One thing you’ll notice is that I’m limiting the validity period of the verification to ten minutes. This is more than enough to grab one of every one of those formats (many times over, I suspect) but it also means that you don’t retain perpetual access to the data. If, for example, someone moves to another organisation or on-sells the domain, I don’t want them having ongoing access just because at some point in time in the past they were in a position of authority over the address. Yes, they can still receive notification emails but as I mentioned earlier, they’ll contain only summaries and not the actual addresses of the pwned accounts.
But of course the really important bit is what the domain search result looks like so here’s an example only for troyhunt.com:
And there you have it – domain wide breach searching made easy!
What’s the limit?
I mean how big can a domain search be? Well put it this way – there were over 32 million emails on the Hotmail domain alone when I went through the Adobe breach and no, you’re not going to be able to pull that back! Of course you’d have to be able to verify the domain to begin with but even if you can, sorry, you’ll see something like this:
With the tolerance threshold I’ve set for big data searches, the system maxes out at about 40,000 records. Search for a domain larger than that and you’ll see the message above. Having said that, I expect it will be a very, very rare occurrence. With the exception of literally just a couple of major domains (including adobe.com), the only ones that exceed this size are mail providers. Companies in other lines of business – even the world’s largest companies – don’t come anywhere near that number.
That said, if there was a genuine need then there are various avenues to make this data available. If you own one of these domains and want the data, ping me.
Verification, gaming the system and security controls
If you’re a regular reader, by now you would have heard me frequently berate the assertion that a system is “secure” as though it’s an absolute state of binary enlightenment that one can achieve. Web security – any security for that matter – always comes down to a balance between the nature of the resource it’s protecting, the value it poses to an attacker, the effort required to circumvent the security controls, the impact of a breach and if you’re smart, the compromises made to usability. I can think of ways that an attacker might game the above verification processes, mind you that’s usually by compromising other externalities rather than any known weaknesses in the process, but certainly it’s conceivable. The question is what purpose this would serve them. To compromise personal data? I don’t have any that’s not public anyway. To circumvent a pay wall? I don’t charge any money! These things actually make security significantly easier as the value proposition to an attacker is very different to most commercial websites.
That said, I’ve implemented various controls which limit the ability for the system to be abused. Things like CAPTCHA are obvious, there are many other provisions that aren’t quite so obvious which is just the way I’ll keep them :)
Before launching this service I opened it up to an audience of private beta testers. All told, between people who offered to jump in on the domain testing and those who had previously helped test the notifications and before that the first of the site, I had a total of 57 testers. Many of these people gave me some very excellent feedback, much of which was used to improve the site as you find it today. They did so selflessly and out of their own desire to help a fledgling project and I’m enormously grateful for their support.
So this is one more major feature out and I’ll be very closely watching the usage and responding to any hiccups that might appear. I don’t expect it all to go smoothly without any unexpected issues; it rarely does. What I’d really like to continue encouraging though is direct, honest and open feedback on what’s working well but even more importantly, what’s not working well or what other features you’d like to see this project support. It’s an ongoing evolution that I suspect has a long albeit not entirely clear roadmap in front of it so do please keep those comments and tweets flowing.
Enjoy your domain searching!