The headline is pretty self-explanatory so in the interest of time, let me just jump directly into the details of how this all works. There's been huge interest in this incident, and I've seen near-unprecedented traffic to Have I Been Pwned (HIBP) over the last couple of days, let me do my best to explain how I've approached the phone number search feature. Or if you're impatient, you can head over to HIBP right now and search for your number.
I'd never planned to make phone numbers searchable and indeed this User Voice idea sat there for over 5 and a half years without action. My position on this was that it didn't make sense for a bunch of reasons:
- Phone numbers appear far less frequently than email addresses
- They're much harder to parse out of most data sets (i.e. I can't just regex them out like email addresses)
- They very often don't adhere to a consistent format across breaches and countries of origin
Plus, when the whole modus operandi of HIBP is to literally answer that question - Have I Been Pwned? - so long as there are email addresses that can be searched, phone numbers don't add a whole lot of additional value.
The Facebook data changed all that. There's over 500M phone numbers but only a few million email addresses so >99% of people were getting a "miss" when they should have gotten a "hit". The phone numbers were easy to parse out from (mostly) well-formatted files. They were also all normalised into a nice consistent format with a country code. In short, this data set completely turned all my reasons for not doing this on its head.
And finally, when I asked the masses, the responses were "for" rather than "against" by a ratio of more than 2 to 1:
Should the FB phone numbers be searchable in @haveibeenpwned? I’m thinking through the pros and cons in terms of the value it adds to impacted people versus the risk presented if it’s used to help resolve numbers to identities (you’d still need the source data to do that).— Troy Hunt (@troyhunt) April 4, 2021
Have I Been [Something'd]
Another reason for pushing this feature out now is the sudden emergence of HIBP clones. I use this term endearingly; it's flattering to see my project influence others 🙂 But I also have absolutely no idea how trustworthy any of the multiple variations I've seen pop up already are. So, to avoid any shadow of doubt, I wanted to make sure that if you'd like to know if you've been pwned in the Facebook data, you can ask HIBP regardless of whether it's an email address or a phone number you're interested in.
Phone Number Format
The existing search endpoints simply identify that the string being searched for isn't an email address and that it adheres to a basic phone number pattern, namely that it's between 10 and 14 digits long. All phone numbers are stored with their country calling code so Aussie numbers begin with 61, the UK is 44, North America is 1 and so on and so forth. And just like when you call an international number, the leading 0 gets dropped off so an Aussie number we might normally dial as 0403... becomes 61403...
This style is known as E.164 international phone number formatting and for many people, it's a very familiar pattern. But just in case it's not, here's a great guide put together by Twilio (a previous blog sponsor - thanks folks!) which explains it very clearly:
When you search any of the endpoints on Have I Been Pwned, you can add a + prefix if you like and it'll be automatically stripped off when performing the search. Same with spaces and same with dashes.
There's No Notifications or Verifications
At this stage, you can't subscribe for a notification when a phone number is pwned nor is there any concept of verifying a number to search sensitive breaches. That would require SMS delivery which obviously has a cost, but also a workload I just can't justify investing at this time.
Will Phone Numbers from Other Breaches be Searchable?
Not unless there's a similar value proposition to the Facebook data. I'm not about to go trawling back through huge volumes of previous breach data and parsing out phone numbers. But if there's a repeat of the Facebook situation in the future, I'll be well-positioned to get the data loaded in.
The Data Is... Varied
Every single time I deal with a large incident that's not sourced from one clear location (i.e. credential stuffing lists), I get a heap of "but my copy of the data is different" or "this and that is fake" messages (just look at the comments on this Gist as an example). I've had the same experience even just tweeting about the Facebook data in this thread:
I was sent data a couple of weeks before the headlines went nuts with reports of "533M Facebook Accounts". The data I was sent had 370M records and I later obtained the larger corpus which is now in very broad circulation. A lot of it is the same, but a lot of it is also different. Read through that thread and the discussion that has ensued but take one thing away with you now: there is not one clear source of this data and people will argue about numbers, formats and all sorts of other things. Consider this a "best effort" based on the information at my disposal.
There are still a few areas within the HIBP website that need to be updated to reflect the phone number paradigm (i.e. API docs and FAQs). I've prioritised making data searchable and will come back to these later.
There's no k-anonymity implementation for phone numbers at this point in time. Consequently, the model used by the likes of Mozilla and 1Password won't cover Facebook phone numbers, only email addresses. I'll revisit this in the future if there's sufficient demand.
The origin of all this data is still not clear. The initial set I was given adhered to a very consistent format, the set in broader circulation is more varied suggesting they're possibly from multiple sources. Some people have suggested WhatsApp or Instagram as potential additional sources, but I've seen nothing to substantiate those claims.
Facebook are yet to put out a clear position on this. They've alluded to a 2019 incident being the root cause, but that doesn't go far enough to explain the data in circulation. There's a vacuum of information right now, and that vacuum is being filled with by a lot of speculation.
And finally, one last note on the data load process: At the time of publishing this blog post, all phone numbers beginning with international codes "4", "6", "7" and "8" have completed loading. The other codes are in progress and may take several hours more before they're searchable. I'll add an edit below once I can confirm they're all complete.
Edit 1: "1" is also complete.
Edit 2: "3" and "8" are also complete.
Edit 3: All data has now completed loading