Mastodon

I'm Open Sourcing the Have I Been Pwned Code Base

Let me just cut straight to it: I'm going to open source the Have I Been Pwned code base. The decision has been a while coming and it took a failed M&A process to get here, but the code will be turned over to the public for the betterment of the project and frankly, for the betterment of everyone who uses it. Let me explain why and how.

HIBP is a Community Project

I've been giving a great deal of thought to how I want this project to evolve lately, especially in the wake of the M&A process that ended earlier this year right back where I'd started: with me being solely responsible for everything. The single most important objective of that process was to seek a more sustainable future for HIBP and that desire hasn't changed; the project cannot be solely dependent on me. Yet that's where we are today and if I disappear, HIBP quickly withers and dies.

As I've given further thought to the future since the M&A process, the significance of community contributions has really hit home. Every single byte of data that's been loaded into the system in recent years has come from someone who freely offered it in order to improve the security landscape for everyone. Many of the services that HIBP runs on are provided free by the likes of Cloudflare. Much of the code that's been written has drawn on community contributions either by virtue of content people have published publicly or support that's been provided to me directly.

I was reminded of this just yesterday when my friend from Cloudflare, Junade Ali, posted this:

This tweet isn't entirely accurate; it was all Junade's idea and he designed the k-anonymity implementation for HIBP's Pwned Passwords. For free, because he's a good bloke and Cloudflare supported him. LastPass has now employed that same model and they follow the other notable names Junade mentioned. I'm sure I speak for him as well when I say we couldn't be happier that other companies have taken the model we pioneered and applied it to their own services too because at the end of the day, that's in everyone's best interests.

The philosophy of HIBP has always been to support the community, now I want the community to help support HIBP.

Open sourcing the code base is the most obvious way to do this. It takes the nuts and bolts of HIBP and puts them in the hands of people who can help sustain the service regardless of what happens to me. But this isn't just a philosophical decision based on a desire to offload work, it's also common sense for a number of reasons. Let me explain:

HIBP Has Always Been Open in Spirit

I've already written extensively about the architecture of the system across many of the 128 previous blog posts tagged as Have I Been Pwned. The very second blog post on that tag was about how I used Azure Table Storage to make it so fast and so cheap. As soon as it got popular, I wrote about how I optimised it for performance. When I started using Azure Functions, I wrote about the joy of serverless computing and how I'd implemented it in HIBP. I levelled that up even further when I wrote about using Cloudflare Workers to further optimise performance and drive down cost.

The point is that it was always the intention to be completely open about the design of HIBP, it's not like there's any proprietary secret sauce I've been trying to protect here.

Open Source is Everywhere

A heap of really amazing projects are open source these days. Visual Studio Code, for example, is open source. The platform this very blog runs on, Ghost, is open source. Most of the libraries HIBP uses are open source. And I'm not just talking open source in the "source open" kind of way where other people are free to read it, but I'm talking open source in terms of taking contributions as well.

It actually got me thinking - how many of the products and services I use every day are open source? I asked on Twitter earlier today, and it's, well, extensive:

I love also that Microsoft remains one of the largest corporate contributors to open source, maybe even the largest depending on how you want to define the metric. Open source is in the DNA of everything that HIBP is built on.

Because Transparency

Putting the code out there in public goes a long way to addressing concerns people have about the way the service operates. For example, people have often questioned whether I'm logging searches in order to build up a new list of email addresses. No, I'm not, but at present that assertion effectively just boils down to "trust me". Showing the code - the actual code - and demonstrating that things aren't logged is a very different proposition.

Transparency of code mirrors the ethos I've applied time and time again to the way I run HIBP. I'm transparent about how I verify data. I'm transparent about government usage of the service. I'm transparent when I screw up and have system outages. Being transparent with the code feels like the most natural thing ever!

It's (Almost) All About the Contributions

Open sourcing the HIBP code base gives me the opportunity to address that original problem I set out to solve with the M&A process: finding other people that can help sustain the project. All that backlog, all those bugs, all the great new ideas people have but I simply can't implement myself can, if the community is willing, finally be contributed back into the project.

And that's something that I'm adamant about; the goal here isn't just to say "hey, look at the code, it's not logging your searches", it's fundamentally about making HIBP a more sustainable, more robustly featured community service. Frankly, I can't think of a single good reason why I wouldn't do this. But that said, it's also not as trivial as it sounds so let me talk about the practicalities of the whole thing.

Practically, There's Work to be Done

I started writing HIBP on a plane to the Philippines in 2013 and finished up a bunch of it in a hotel room once I landed. In the near 7 years since then, I've chipped away at it in little bits and pieces, frequently from a laptop while travelling, jet lagged and preoccupied. I've taken shortcuts. I've hacked together some pretty messy stuff. I've probably checked in secrets before and when you're the only person touching a project you can get away with all that stuff, but not once you start opening up source.

HIBP isn't in a state to simply flick the visibility of it in GitHub, but it needs to get to that point. Instead, I need to choose the right parts of the project to open up in the right way at the right time. That exercise alone requires help and for a while now, I've been talking to some of the smartest people I know in this space. People who live and breathe open source, people who understand .NET and Azure inside and out, people who know HIBP well and above all, people I trust to expose my own shortcomings so that they can help me make this thing more sustainable. With their support, the transition from completely closed to completely open will happen incrementally, bit by bit and in a fashion that's both manageable and responsible. Let me be clear: I don't have a timeline for each step along the way yet as HIBP remains something I do in my spare time and I've always got a bunch of other stuff on my plate, but the process has already begun and I'll be sharing more on that as soon as I can.

I want to get to a point where everything possible is open. I want the infrastructure configuration to be open too and I want the whole thing to be self-sustaining by the community such that I make myself redundant. That's not to say I'm planning an exit (far from it), but it's not good for HIBP that I can't exit right now and frankly, it's not good for me either.

The point is that the goals outlined in this blog post will take time to reach and they're not as trivial as they may sound at face value. HIBP remains a pet project run when I have the chance and somewhere within there I need to make the commitment to get it to the point I'm aiming for in this blog post.

What About the Data?

I need to really clearly break this part of the discussion out because whilst open sourcing the code base is one thing, how the data is handled is quite another. There's no way to sugar coat this so I'll just lay it out bluntly: HIBP only exists due to a whole bunch of criminal activity resulting in data that's ultimately ended up in my possession. Of course, the situation is a bit more nuanced than that with the vast bulk of data in HIBP already being in broad public circulation and passing through many hands. But be that as it may, even the legality of possessing it remains grey and whilst there are many internet armchair experts chiming in with their own opinions on the topic, here's what the legal guidance I've consistently been given boils down to:

We invite parties to form their own views on the legality of the data

Great, nice lawyer speak there guys. (And seriously, yes, that's what the KPMG lawyers from the M&A process I paid an eye-watering amount to advised.) Yet clearly, many of the world's largest companies do see value in it and conclude that holding the data is acceptable. Big tech companies, for example, pull down precisely the same breaches that go into HIBP and use them to identify credential reuse across their own platforms:

Then there's the privacy side of it all: my own personal data is in those breaches and your data almost certainly is too because there are literally billions of people that have been impacted by data breaches. Regardless of how broadly that information is circling, I still need to ensure the same privacy controls prevail across the breach data itself even as the code base becomes more transparent. That's non-trivial. Doable, but non-trivial.

Summary

This is something I've given a lot of thought to for a long time now. The concept of open sourcing HIBP has been floated over and over again and it's taken a failed M&A process to help me realise that this was the best path forward, but now here we are.

I've used the word "community" a lot throughout this post and I can't understate the importance of the role other people have played in the project's success. Just to really drive that point home, look at how many breaches have gone into HIBP in the last two weeks. At the time of writing, that's 16 breaches encompassing 95,850,490 records and every single one of those has been a community contribution; someone selflessly standing up and trusting me to handle the data in the best interest of others. I focus on that short time frame in particular here because it also demonstrates the constant flood of data and the need to scale myself more efficiently.

So that's where HIBP is heading. I know this blog post will be met with much enthusiasm because that's what many of you have been telling me to do for a long time. I've listened, now it's time to make it a reality 😊

Have I Been Pwned
Tweet Post Update Email RSS

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals