Troy Hunt: Have I Been Pwned Domain Searches: The Big 5 Announcements!

There are presently 201k people monitoring domains in Have I Been Pwned (HIBP). That's massive! That's 201k people that have searched for a domain, left their email address for future notifications when the domain appears in a new breach and successfully verified that they control the domain. But that's only a subset of all the domains searched, which totals 231k. In many instances, multiple people have searched for the same domain (most likely from the same company given they've successfully verified control), and also in many instances, people are obviously searching for and monitoring multiple domains. Companies have different brands, mergers and acquisitions happen and so on and so forth. Larger numbers of domains also means larger numbers of notifications; HIBP has now sent out 2.7M emails to those monitoring domains after a breach has occurred. And the largest number of the lot: all those domains being monitored encompass an eye watering 273M breached email addresses 😲

The point is, just as HIBP itself has escalated into something far bigger than I ever expected, so too has the domain search feature. Today, I'm launching an all new domain search experience and 5 announcements about major changes surrounding it. Let's jump into it!

Announcement 1: There's an all new domain search dashboard

Every time I look at numbers related to domain searches, they stagger me. One of the stats I found particularly interesting was that of those 200k people monitoring domains, 23k of them were monitoring 2 or more domains. 8.5k were monitoring 3 or more. 4.6k were 4 or more and so on and so forth. The point being that there are a very large number of people monitoring multiple domains. In fact, 1k people are monitoring 9 or more and hundreds have gone through the manual verification process at least 2 dozen times.

To make life much, much easier on those folks monitoring multiple domains, they're now all bundled up into a centralised dashboard accessible from the existing Domain search link on the website. Because I already know who is monitoring which domains and the email address they're using for notifications, that same email address can be used to verify your identity and drop you straight into the dashboard. Here's mine:

One of the problems the dashboard approach helps tackle is unsubscribing on an individual domain basis. In the past, the only way to unsubscribe from domain notifications was to wait until one landed in your inbox then unsubscribe from every single monitored domain in one go. It was an all or nothing affair that nuked the lot of them whereas now, it's a domain-by-domain exercise.

Another problem this solves is how I respond to an often-received question: "Hey, can you tell me which domains I'm currently subscribed to". Uh, the ones you verified? Like, possibly almost a decade ag... ah, yeah, that's a poor answer! The dashboard now makes the answer crystal clear.

And finally, another massive problem it helps tackle is verification, and that brings me to the second big announcement:

Announcement 2: From now on, domain verification only needs to happen once

I originally introduced domain searches to HIBP only 6 weeks after the project first launched. Up until this week, it functioned exactly the same way for almost a decade: plug in a domain name, verify control of it then see the results. Each and every time. What it meant is that if you wanted to search a domain, you successfully demonstrated control then you came back later and tried to search it again, you had to go back through the same process:

You'd be surprised at how many emails I get about the difficulty this poses. We don't have any of those 4 aliases on our domain. We can't add a meta tag. We can't upload a file. We can't touch DNS. It leaves me prone to asking "well do you really have control of the domain?" Thing is, "control" is a bit of a nuanced term; there are many people in roles where they don't have access to any of the above means of verification but they're legitimately responsible for infosec and responding to precisely the sorts of notifications HIBP sends out after a breach. Usually in these cases they can get support to go through the verification process, but it involves formal internal processes, ticketing, documentation and having to explain to some IT ops person why a data breach website with a funny name needs one of the above things to happen. This doesn't fix the pain of doing it once, but it does mean that it's now a one-off pain.

Announcement 3: Domain searches are now entirely "serverless"

As the popularity of HIBP and domain searches has grown over the years, another challenge has emerged. Let me illustrate by example: in January this year, I loaded a rather large breach into HIBP:

New scraped data: Twitter had over 200M accounts scraped from a vulnerable API in 2021. Email addresses were passed in and Twitter profiles returned. 98% were already in @haveibeenpwned. Read more: https://t.co/FRBDFk3nkp
— Have I Been Pwned (@haveibeenpwned) January 5, 2023

That's a sizeable whack of data, in fact it was the 14th largest in HIBP out of the existing 644 in there at the time. It also had a massive impact on HIBP subscribers; I sent over 1 million emails to individuals using the notification service which made it the single largest corpus of notification emails we'd ever sent by a significant margin. But further, I also sent 60,851 emails to people monitoring domains. And that's when this started happening:

6 minutes later...

And so on and so forth until my inbox looked like this:

This was Azure auto-scale doing its thing and it was one of the early attractions for me building HIBP on Microsoft's PaaS offering way back in 2013. Need more resources? Just add more cloud! Job done, next problem. Except there are 2 major drawbacks with this:

Auto-scale is reactive. You get extra capacity in response to demand but if demand spikes too fast, you're left without sufficient resources. I learned this the hard way and wrote about it in detail in 2016.
I pay for it. When load spikes and additional instances are scaled out, I'm billed for it whilst those instances are spun up. It's great that domain searches are free for the end user, but they're not free for me 😔

Domain searches were actually one of the last remaining remnants of a resource intensive process still running on PaaS; most of the other important bits (namely email address searches and Pwned Password's k-anonymity searches) had been on Azure Functions for ages. Functions are awesome as they're "serverless" (except for the servers they run on, but don't let me get in the marketing team's way here), in that you're never deploying large logical containers of compute like with auto-scale so that solves problem 1 above.

As of now, all domain searches run on Azure Functions. There's literally no domain search logic remaining in the Azure App Service PaaS model, it's all gone. That moves things over to much more scalable infrastructure and massively reduces the likelihood of a timeout when searching a larger domain.

Announcement 4: There are lots of little optimisation tweaks

I didn't just want to ship a model from years ago and reproduce all the assumptions of the day, so I made a bunch of tweaks to further optimise things. These are all things that benefit both those searching domains and me running the platform as they reduce overhead on everyone.

For example, there was no point searching for a domain then listing every alias on it "@domain.com" so now you'll just see "alias@" instead. Doesn't sound like a lot, but imagine a domain with tens of thousands of results and then a heap of orgs running searches on them. More data equals more processing equals more egress bandwidth equals more latency and more cost. (Sidenote: if you're wondering "how costly can a bit of bandwidth really be", read my post from last year on How I Got Pwned by My Cloud Costs.)

The same logic extended to exporting the domain search results in Excel or JSON format - strip out the redundant data. I went even harder on the JSON front as this format is primarily used for ingestion into other apps where there's a large amount of programmatic control. So, rather than returning a heap of redundant breach metadata over and over again, now each alias just lists the name of the breach and you can match that up to the data from the breaches API. To be clear, the domain search JSON format itself was never an "API"; it wasn't designed for programmatic consumption, it required manual verification first and I set no expectation of stability. That's something that will change soon - there'll be a proper API - but I'll come back to that at the end of this post.

Something else I've been working away on in the background is to better leverage Cloudflare's WAF to minimise the impact on the origin services. For example, last week I did a thread on blocking 401 and excessive 428 responses at the edge rather than having to process them (and pay to process them) at the origin. I've been using similar logic to keep some, well, let's just call it "very excessive" domain queries under control. For example, one particular domain was searched 140 times after a breach was loaded in April, followed by another 40 times immediately after a breach the following month:

Clearly, this is just unnecessary. Remember how domain searches are a resource intensive process that hits my bottom line pretty hard? Yeah, well, not any more!

And finally on the performance front, if you were previously monitoring multiple domains and you got a breach alert, you could run a single search that bundled all the results in together. You reckon searching for one domain can be resource intensive? Try throwing a bunch of them into the one search! As the system grew and grew, this model became increasingly hard to sustain and equally, it became increasingly noisy. So now, exactly the same domains can be searched one by one which breaks the processing down into smaller, more manageable units. Hey, wouldn't it be great to have an API around that so you could just automate the entire thing? Read on!

All these tweaks along with the move to Azure Functions has made a massive difference to the performance problem mentioned earlier, but another problem remains: I'm still paying for your domain searches. Azure Functions are charged based on a combination of how long they run for and how many resources they consume. Both those factors are extraordinarily small for individual email address searches, but they're not for domain searches. That's why soon, the largest users of the service are going to see a small fee.

Announcement 5: Searches for small domains will remain free whilst larger domains will soon require a commercial subscription

Pick a brand. A big brand. If I was to bet you that either the brand directly or its parent company has used the HIBP domain search feature in the past, I'd win. I wouldn't win every bet, but I'd come out on top over a bunch of them and I know this because I have the data to be confident of my odds 🙂

Knowing which big brands use which domains for their email is actually a hard metric to define:

Anyone know where I can find a list of the Fortune 500’s domains used for email accounts? There may be more than 1 per company and it may be different to their primary website.
— Troy Hunt (@troyhunt) January 15, 2023

But by cobbling enough OSINT data together, I was able to confidently demonstrate that more than half the Fortune 500 have used this service and the vast majority of those continue to do so via ongoing domain monitoring. That's awesome! And that pattern extends all the way down to much more localised brands too; My bank. My telco. My supermarket. All sorts of commercial organisations running businesses and using data sourced from HIBP to help them do so.

I started analysing the metrics back at that tweet in Jan, just the week after all the domain searches following the scraped Twitter data going into HIBP. For the last 5 months, I've been trawling through the usage patterns and watching how organisations are using the service. I also paid a lot of attention to the reactions following the change in rate limits and annual billing for the public API that enables email address searches last Nov. That's now given me a pretty good sense of how to structure a commercial domain search model. It's not final yet, but I do hope to put the finishing touches on it next month and in the interim, welcome feedback on the high-level overview of how it'll work that I'll list here in point form:

I can reliably establish the size of a domain based on the number of email addresses that appear against it in breaches
There is a size at which domain searches should remain totally free and that size will usually indicate a small business or website or a personal domain (certainly every domain you see in the hero image of this blog post, for example)
Like with the aforementioned API for email address searches, there should be tiers of scale that reflect domain size and increase proportionately in price for larger organisations
Commercial subscribers should get more than they do now - they should get domain searches by API!

That last point in particular is hotly requested and as of a couple of months ago, already under development:

UserVoice suggestion for @haveibeenpwned to add domain search capability to the API now started! Follow along, vote and subscribe to updates here: https://t.co/Z32eC0d9nb
— Troy Hunt (@troyhunt) April 20, 2023

I'm still working through the mechanics of all this, both technically and commercially. One part of that is looking at raw numbers, for example about half of all the domains being monitored have 10 or less breached accounts on them. These aren't commercial entities of any scale and whilst I'm not saying "10 is the free tier number", clearly there are a massive number of domains that are tiny and shouldn't be at all impacted by this.

To be honest, the experience with the public API keys has taught me that it's usually not money that's the barrier to using commercial services, it's corporate procurement bureaucracy. Onboarding documentation. Vendor assessments. Tax forms. All sorts of things that demand hours of our time, often for the sake of only $3.50 per month. So we politely decline 😊 I know that will be an issue, in fact I suspect it will be the issue and a lot of the work we've been doing this year is to try and ease that pain to the fullest extent possible. I'll talk more about that once things finally launch but for now, that's the direction we're heading and the sorts of issues we're tackling in preparation.

Summary

As we approach the 10th birthday of HIBP later this year, it's hard not to look back and reflect. So much has changed in that time, yet the service still feels very much like what it was on day 1. The challenge for me over this time has been to work out how to adapt to the changes whilst keeping true to the original intent of service. Nothing has happened quickly in that regard, and the transparent fashion in which I've chosen to run HIBP has made the rationale for any change very clear to everyone. Even this blog post has been 5 months in the making, gradually evolving to reflect my thinking on the issues until I was confident enough in the path forward.

Go and use the new dashboard. Give it a good run and let me know what you think as I'm sure there are many things we can do better. And do provide your feedback on the both the changes announced here and those to come regarding the commercial tiers too, the more input we get on this the better equipped we are to make good decisions.

Have I Been Pwned

Have I Been Pwned Domain Searches: The Big 5 Announcements!