I've got 2 massive things to announce today that have been a long time in the works and by pure coincidence, have aligned such that I can share them together here today. One you would have been waiting for and one totally out of left field. Both these announcements are being made at a time where Pwned Passwords is seeing unprecedented growth:
That's significant because the sheer volume of requests greatly amplifies the effectiveness of the announcements below. So, keeping in mind this will all be leveraged nearly 1 billion times a month (and much more in the future), read on...
Pwned Passwords is Now Open Source via the .NET Foundation
Back in August I announced that I planned to open source the HIBP code base. I knew it wouldn't be easy, but I also knew it was the right thing to do for the longevity of the project. What I didn't know is how non-trivial it would be for all sorts of reasons you can imagine and a whole heap of others that aren't immediately obvious. One of the key reasons is that there's a heap of effort involved in picking something up that's run as a one-person pet project for years and moving it into the public domain. I had no idea how to manage an open source project, establish the licencing model, coordinate where the community invests effort, take contributions, redesign the release process and all sorts of other things I'm sure I haven't even thought of yet. This is where the .NET Foundation comes in.
After announcing the intention to go open source, my friend and executive director of the foundation Claire Novotny reached out and offered support, thus beginning a new conversation. I've known Claire for years previously as another Microsoft Regional Director and subsequently as a Microsoft employee and Project Manager on the .NET team. But the .NET Foundation isn't part of Microsoft, rather it's an independent 501(c) non-profit organisation:
The .NET Foundation is an independent, non-profit organisation established to support an innovative, commercially friendly, open-source ecosystem around the .NET platform.
There's a whole page dedicated to the advantages of leaning on the .NET Foundation but in short, they have the answers to all the questions I have no idea about and the dependency HIBP has on the Microsoft stack makes it a natural fit. That it's staffed by a bunch of people I've known and respected for many years and in turn, people that are already familiar with HIBP, makes it a natural fit.
Speaking of natural fits, Pwned Passwords is perfect for this model and that's why we're starting here. There are a number of reasons for this:
- It's a very simple code base consisting of Azure Storage, a single Azure Function and a Cloudflare worker.
- It has its own domain, Cloudflare account and Azure services so can easily be picked up and open sourced independently to the rest of HIBP.
- It's entirely non-commercial without any API costs or Enterprise services like other parts of HIBP (I want community efforts to remain in the community).
- The data that drives Pwned Passwords is already freely available in the public domain via the downloadable hash sets.
So, I can proverbially "lift and shift" Pwned Passwords into open source land in a pretty straightforward fashion which makes it the obvious place to start. It's also great timing because as I said earlier, it's now an important part of many online services and this move ensures that anybody can run their own Pwned Passwords instance if they so choose. My hope is that this encourages greater adoption of the service both due to the transparency that opening the code base brings with it and the confidence that people can always "roll their own" if they choose. Maybe they don't want the hosted API dependency, maybe they just want a fallback position should I ever meet an early demise in an unfortunate jet ski accident. This gives people choices.
That's the open sourcing covered, but what Pwned Passwords really needs to be successful is fresh passwords as they become compromised, and this is where the FBI comes in.
The FBI's Feed of Pwned Passwords
As you can imagine, the FBI is involved in all manner of digital investigations. For example, they recently made headlines for their role in taking down the Emotet botnet in conjunction with their law enforcement counterparts in other parts of the world. They play integral roles in combatting everything from ransomware to child abuse to terrorism and in the course of their investigations, they regularly come across compromised passwords. Often, these passwords are being used by criminal enterprises to exploit the online assets of the people who created them. Wouldn't it be great if we could do something meaningful to combat that?
And so, the FBI reached out and we began a discussion about what it might look like to provide them with an avenue to feed compromised passwords into HIBP and surface them via the Pwned Passwords feature. Their goal here is perfectly aligned with mine and, I dare say, with the goals of most people reading this: to protect people from account takeovers by proactively warning them when their password has been compromised. Feeding these passwords into HIBP gives the FBI the opportunity to do this almost 1 billion times every month. It's good leverage 🙂
I asked the folks there if they'd like to add anything to this blog post and they provided the following statement:
We are excited to be partnering with HIBP on this important project to protect victims of online credential theft. It is another example of how important public/private partnerships are in the fight against cybercrime.
- Bryan A. Vorndran, Assistant Director, Cyber Division, FBI
The passwords will be provided in SHA-1 and NTLM hash pairs which aligns perfectly to the current storage constructs in Pwned Passwords (I don't need them in plain text). They'll be fed into the system as they're made available by the bureau and obviously that's both a cadence and a volume which will fluctuate depending on the nature of the investigations they're involved in. The important thing is to ensure there's an ingestion route by which the data can flow into HIBP and be made available to consumers as fast as possible in order to maximise the value it presents. To do that, we're going to need to write some code. That's right, we're going to need to write some code and thus begins the first piece of open source work for HIBP.
Help Me Build the Code for Password Ingestion
This is a great little first project to distribute to the community and I'm really excited not just about collaboratively working on the code, but that we're doing it in conjunction with a major law enforcement agency to make a positive difference to the world via a free community service. It's wins all round. Here's what I'm thinking:
- There's an authenticated endpoint that'll receive SHA-1 and NTLM hash pairs of passwords. The hash pair will also be accompanied by a prevalence indicating how many times it has been seen in the corpus that led to its disclosure. As indicated earlier, volumes will inevitably fluctuate and I've no idea what they'll look like, especially over the longer term.
- Upon receipt of the passwords, the SHA-1 hashes need to be extracted into the existing Azure Blob Storage construct. This is nothing more than 16^5 different text files (because each SHA-1 hash is queried by a 5 character prefix), each containing the 35 byte SHA-1 hash suffix of each password previously seen and the number of times it's been seen.
- "Extracted into" means either adding a new SHA-1 hash and its prevalence or updating the prevalence where the hash has been seen before.
- Both the SHA-1 and NTLM hashes must be added to a downloadable corpus of data for use offline and as per the previous point, this will mean creating some new entries and updating the counts on existing entries. Due to the potential frequency of new passwords and the size of the downloadable corpuses (up to 12.5GB zipped at present), my thinking is to make this a monthly process.
- After either the file in blob storage or the entire downloadable corpus is modified, the corresponding Cloudflare cache item must be invalidated. This is going to impact the cache hit ratio which then impacts performance and the cost of the services on the origin at Azure. We may need to limit the impact of this by defining a rate at which cache invalidation can occur (i.e. not more than once per day for any given cache item).
It's also my hope that the scope of this facility may expand in the future should other law enforcement agencies or organisations that come across compromised passwords wish to contribute. This is just a starting point and I'm really excited to see what direction the community will drive this in.
If I'm completely honest, I don't have all the answers on how things will proceed from here so let me just start with the basics: there's a Have I Been Pwned organisation in GitHub that has the following 2 repositories:
The .NET Foundation folks have helped me out with the former and the Cloudflare folks with the latter. They'll continue to help supporting as community contributions come in and as the project evolves to achieve the objectives above re supporting the FBI with their goals. Running an open source project is all new for me and I'm enormously appreciative of the contributions already made by those mentioned above. Bear with me as a I navigate my own way through this process and a massive thanks in advance for all those who decide to contribute and support this initiative in the future.
Just one more thing - there's a third repository in that organisation. Because there was so much enthusiasm over this 3D print earlier in the week, I've dropped the .stl into the 3D Prints repository so you can go and grab it and print it yourself. And if you don't have a 3D printer, I'll be sending a bunch of these out I've printed myself to people that make significant contributions to the project 🙂
Pretty happy with this now, might need to start some mass production: pic.twitter.com/L3GkZOxBWZ— Troy Hunt (@troyhunt) May 25, 2021