Working on Have I been pwned (HIBP), I come across a lot of interesting things. Interesting people dealing in data breaches, interesting vulnerabilities in systems which have been compromised and interesting requests from people wanting the data. In fact, I was getting so many requests for data I ended up writing No, I cannot share data breaches with you where I very explicitly laid out how I wouldn’t give people their own record from a breach, I wouldn’t give the data to researchers and I wouldn’t trade data breaches. I still hold that view – nothing has changed there – but I’ve been receiving some requests recently for access to data which is causing me to stop, think and, well, write this blog post seek your feedback. Let me explain.
HIBP is used in a number of different ways by large organisations. Some of them rely on the public API to check the exposure of their users and notify them, many have domain subscriptions which send them a notification if one of their colleagues is in a breach and a small number are using the commercial callback implementation to notify them when an account they’re monitoring is impacted. However, there are a few things stopping other orgs from using the service in any of these ways, things I can’t overcome with the current model.
The barriers I’m hearing from organisations who would like access to the data to do good things (I cannot emphasise that point enough) are two-fold; service location and privacy. On the former, these orgs are typically European and are beholden to local legislation which doesn’t like their data being sent outside the EU and subscribing emails to or querying HIBP would do just that. I could stand up an EU instance of the service, but it doesn’t solve the next issue which is privacy. These companies are understandably worried about sending me any of their data. They don’t want a situation regardless of where the service runs where I know who their customers are and I totally get that – I’d have the same concern.
So it leaves me in somewhat of a quandary; these organisations want to do good things with the data but my existing constructs make that impossible. Let me talk about the sorts of things they want to do though as that will put things in more perspective and I’ll start with an example. Someone sent me this from Amazon just the other day, have a read:
From: email@example.com Subject: Your Amazon password has been changed Date: March 16, 2016 at 4:02:41 PM CDT To: [redacted]
At Amazon we take your security and privacy very seriously. As part of our routine monitoring, we discovered a list of email address and password sets posted online. While the list was not Amazon-related, we know that many customers reuse their passwords on several websites. We believe your email address and password set was on that list. For your security, we have assigned a temporary password to your account.
You will need to reset your password when you return to the Amazon.com site. To reset your password, click "Your Account" at the top of any page on Amazon.com. On the Sign In page, click the "Forgot your password?" link to reach the Amazon.com Password Assistance page. After you enter your email or mobile phone number, you will receive an email containing a personalized link. Click the link from the email and follow the directions provided.
Your new password will be effective immediately. We recommend that you choose a password that you have never used with any website.
This is actually really cool – Amazon are sourcing data breaches then notifying their customers when they find a credential match. This isn’t entirely new, in fact I wrote about how both them and LinkedIn were doing it back in November. There are many fascinating things about this: Firstly, they’re actively acquiring data breaches, an activity that some have suggested is on the shady side of legal yet here we have some of the web’s largest properties doing it to do good things. Secondly, given that the vast majority of breaches have some form of cryptographic storage for passwords (even if it’s just plain old MD5), they’re either cracking hashes or hashing their customers’ passwords with the algorithm of the breached site when they log in (or register or change their password). It’s quite possibly the former as the individual who sent me the email above noted that it was an account he hadn’t used for years.
I can only see upsides to the likes of Amazon doing this. They’re decreasing the risk to the individual and they’re decreasing the likelihood of them having to deal with account takeovers by malicious parties. If they get onto this fast enough after a breach is found (and that’s always going to be the challenge), they have the ability to put a serious dent in the value of the data for those who wish to do harm.
All this brings me back to the requests I’ve been getting for access to HIBP data. These are coming from large names you’d recognise in various technology sectors and the common thread from these legitimate big players is that they genuinely want to improve the security of their existing customers. They want to do the same thing as Amazon and LinkedIn except they want the ability to consume that data “as a service”. But they don’t want the full data breach; they don’t want birth dates and genders and in a case like VTech, the names of people’s kids or for Ashley Madison, what your bedroom preferences are. They only want email addresses and passwords and in some cases, only just the email addresses.
I’ve thought of many different ways to do this without sending large volumes of data to the organisations keen on this approach. For example, I could have them send me a hash of the email they want to monitor and then hash each incoming email in a new breach and match based on those. In fact, this is already a suggestion in the HIBP User Voice but the problem is that I can still derive the original value of an address once I match it to one in a breach and they’re still “giving” me addresses which is against their rules. I’ve thought of just providing directions on where an org can obtain the breach from but not only is this not always possible (many are sent to me privately), there’d still be a bunch of processing work required on their behalf, work that I already do and they’d like to consume “as a service”.
The more I’ve thought about this, the more I keep coming back to one key principle: could this help data breach victims? Would it reduce the number of accounts they’ve reused credentials across getting compromised and increase their awareness of the risks they face? It’s an emphatic “yes” and the remaining questions about whether this should happen or not are related to the downsides. Would this violate their privacy? That’s largely dependent on how the organisation I send this to handles it. Could it be done securely? Certainly we have mechanisms to transfer data in a highly secure fashion, it potentially wouldn’t even need to touch the Azure-based HIBP service (i.e. could be distributed directly from the environment in which the processing is done).
Here’s what I’d like from you: On balance, is this a good thing for data breach victims? Which organisations should have access to this data? What controls would you put in place? Is this something that would be valuable to your organisation? Anything else?
If you’ve been following HIBP for any period of time, you’ll know that I’m super cautious around issues of privacy and using the data for good. The last thing I’d ever want is to have all sorts of issues arise that I’d never even thought of. It’s not a small thing either – I’d have to make it a commercial offering – and there’d be a significant time commitment from me not just to build the mechanism but also each and every time I process a data breach. An agreement in place with a consuming organisation would also need to strictly outline how the data may be used, for example only to provide services that improves the security of their existing customers and not to elicit business from new ones. But I think it has the potential to do good things in the right organisations, such as what you see above with Amazon.
Leave your comments below folks or if you want to share anything privately, see the contact page. I’d love to hear your feedback.
Update: Have a read of Nzall's comment about hashing. This sounds like a good model for reducing the risk of abuse as the customer would need to hash and compare each and every email on their end, they could never simply pull the entire list of addresses from the breach.