Troy Hunt: Observations and thoughts on the LinkedIn data breach

Last week there was no escaping news of the latest data breach. The LinkedIn hack of 2012 which we thought had "only" exposed 6.5M password hashes (not even the associated email addresses so in practice, useless data), was now being sold on the dark web. It was allegedly 167 million accounts and for a mere 5 bitcoins (about US$2.2k) you could jump over to the Tor-based trading site, pay your Bitcoins and retrieve what is one of the largest data breaches ever to hit the airwaves.

But this is not a straightforward incident for many reasons and there are numerous issues raised by the data itself and the nature of the hack. I've had a heap of calls and emails from various parties doing stories on it over the last week so I thought I'd address some of those queries here and add my own thoughts having now seen the data. I'll also talk about Have I been pwned (HIBP) and the broader issue of searchable breach data.

Why a 4 year lead time?

This is one of the most common questions that comes up - what's been happening since 2012? Why have we only just now seen the data? The easy answer is that I don't know and it's quite possible that LinkedIn doesn't know either.

I wrote a longer piece about this last week in my Security Sense column titled There's a Lot of Hacked Companies We Don't Even Know About and the title pretty much sums it up. I cite other incidents there which demonstrate how often it can be years - sometimes longer than the LinkedIn lead time - between the hack and the subsequent public release of the data. Inevitably there's a catalyst, but it could be many different things; the attacker finally deciding to monetise it, they themselves being targeted and losing the data or ultimately trading it for something else of value.

But speaking of value, how much is the data actually worth?

Is the data worth $2.2k?

This is a recurring question - "is it worth it"? Are 167 million records really worth $2.2k?

Well firstly, I'm fond of the adage that "something is only worth what someone is willing to pay for it" and by all accounts, people have indeed paid for it (more on that later). But if you want to look at it another way, 167 million accounts selling for $2.2k is only 0.001 cents per account which at least to me, feel very cheap indeed.

Well it was 0.001 cents per account but already, we've seen that 5 BTC price drop:

The LinkedIn data has dropped in price and the seller is referencing media coverage of credentials being exploited pic.twitter.com/M8k06TBGjY
— Troy Hunt (@troyhunt) May 22, 2016

This is curious and I suspect that both the points I made in that tweet are related. Earlier this year I observed that HIBP was having an impact on data breach prices due to the increased awareness it raised with those who've had their data exposed. Whilst the data wasn't in HIBP at the time of that tweet, it's possibly what we're seeing here in terms of it having been spread around more broadly. In fact, since then, the price has fallen even further:

The LinkedIn data has dropped in price again, it's now almost half of what it originally was: pic.twitter.com/lLvoyBFmnl
— Troy Hunt (@troyhunt) May 23, 2016

The seller is quoted in a Motherboard story as saying:

[The] more i sell, and more days pass, [the] value drops

He then goes on to claim he's sold the data 6 times over for a grand total of $12k.

The lower price point will obviously make the data more accessible to more people, but it also likely indicates that the value is diminishing as the data is being abused. Speaking of which...

Accounts are being hijacked

The first screen cap I tweeted above references the Motherboard article about hackers hijacking the accounts of prominent individuals. There'll be a small window where this is possible and over time the likelihood of it will diminish as people change passwords on the various services where they've reused the one from LinkedIn. Let's be clear about that too - it's not just LinkedIn accounts that are at risk, the data dump puts many others at risk too.

I had someone contact me after receiving an email from Groupon who'd proactively reset his password right around the time the LinkedIn data started doing the rounds. I asked if perhaps it was related to LinkedIn and he replied:

I had long since changed my LinkedIn, but I guess they used that for Groupon

Then there was this case yeterday:

Possible a result of password reuse with LinkedIn from my less educated days? cc: @troyhunt
— Christopher Moore (@chrisdcmoore) May 22, 2016

It's extremely difficult to prove that the LinkedIn data was the source of subsequent account takeovers on Groupon or Rockstar because that's the nature of password reuse - information obtained from many different sites can authenticate someone to many other sites. But I've seen similar reports from individuals that seem to have increased in volume since the LinkedIn incident and it would be unusual for this not to happen. Treat this as merely anecdotal as I've no way to verify it, but certainly it's a pattern we've seen many times before.

But all of this points to another important observation about the breach: the data is spreading. Account hijacks and proactive password resets in other services both point to the likely redistribution of the data, but this next point absolutely and emphatically confirms how those 167M records are now doing the rounds.

Password cracking

Very soon after the original news last week, we began seeing analysis of the passwords in the dump. KoreLogic runs a password recovery service and obviously they're going to be interested in analysing data of this nature as their ability to crack passwords en masse demonstrates the effectiveness of their service. As of the last update before writing this post, they'd cracked 49,999,999 unique hashes with 11,863,000 remaining. However, that's a much greater proportion of all accounts than the numbers suggest due to multiple people using the same password as other users (there were 1,135,936 LinkedIn members using the password "123456"). 5 days ago (and things have moved on since then), 86% of all accounts had cracked passwords which is a staggeringly high number courtesy of the choice of SHA1 storage (without a salt, that is).

There's a bunch of stats on that blog about the choice of passwords but there's really not much new information to glean; people still make bad password choices. For me, what was more interesting about the whole thing was to witness both how the data was spreading and how comprehensively the weak cryptographic storage was being cracked.

It's not just KoreLogic looking at cracking passwords though, there was this from Jeremi Gosney shortly after news of the leak broke:

First 24 hrs: 151,768,060 / 177,500,189 (85.5%) passwords cracked. Large list + high hit rate slowing things down. #leakedin #passwords16
— Jeremi M Gosney (@jmgosney) May 20, 2016

The cracking stats by professional password crackers is one thing, but we're also seeing them turn up in other places too, for example:

Cracked Passwords

This is on a site which frequently redistributes data breaches, a very different category of site to KoreLogic's with a very different MO to how Jeremi Gosney operates (he frequently provides services to law enforcement). It's just another observation about how the data is spreading.

Many LinkedIn members have not received emails (but some non-members have)

One of the things I've found quite intriguing about how LinkedIn has dealt with this data breach is the way in which notifications have been delivered to those impacted. For example, they started sending out emails to people the day after the news broke last week yet as of now, neither I nor a number of other people who were members before 2012 have received an email:

@troyhunt and still no request from LinkedIn to change my password!
— David (@duckinwales) May 23, 2016

We were definitely in the data breach, but no email. Oddly enough, both David's and my records don't have associated password hashes in the breach data. If notifications are only being sent based on data attributes that are present (or not present) in the leaked dataset, does this mean LinkedIn are sending emails based on the data floating around publicly rather than what was in their membership database back in 2012?

Edit: It's possible that emails were only sent to those who hadn't changed their password since the 2012 incident. If this is the case, many people now have a password - possibly the password they used everywhere - floating around without having been notified. I'm only speculating here but a couple of people made that suggestion and it certainly seems feasible.

Conversely, other people who weren't LinkedIn members have allegedly received notifications. The allegation being made here is that as I observed above, LinkedIn contacted people based on the data in the breach that was being sold. I'm not certain if this is true and there are other possible explanations for how this might happen. But be that as it may, based on what's been written in that post (and keep in mind there's not enough information there to verify the claims), the decision process behind who gets emailed and who doesn't appears to be quite unusual.

Another thing that struck me as odd with the emails is the call to action:

Here the LinkedIn password reset email some people are receiving: pic.twitter.com/qhGM7jFLoh
— Troy Hunt (@troyhunt) May 19, 2016

Edit: Note Harry's comment below and my response regarding the next paragraph. I'm inclined to agree with him and say I've misinterpreted the next paragraph, but I still find the wording still ambiguous as to whether the old password remains active or not.

What this is saying is that the user's current password (which for many people, will be the one in the data breach), is still active. It can be used to login to the site after which the person using it (be that the legitimate owner or someone else who obtained the password) can then set a new one. Now clearly if they'd effectively invalidated all impacted accounts right off the bat it would have caused other issues and perhaps this was ultimately the best of a bad set of options, but it did strike me as unusual.

Having said that, there was another story yesterday that quoted LinkedIn as having now invalidated old passwords:

We have invalidated the passwords of all accounts that were created prior to the 2012 breach that hadn’t updated their password since then

The statement isn't entirely clear, but I'd like to believe it means people who hadn't changed their passwords are now getting reset emails as opposed to being able to use those old passwords in any sort of functional way. Then again, you could also read it as an attacker who logged in and changed a victim's password before this statement came into effect still has control of the account. And of course none of this is at all relevant to the fact that other services where passwords we reused likely remain vulnerable.

But there's another problem with breach notification emails like these and it's one we see time and time again after any significant event like this..

Phishing

Inevitably, we'll see a heap of stuff like this:

@troyhunt Pls help! after getting this email yesterday,today while changing my LI password,I've lost all my professional data.@LinkedInHelp
— Debayan Das(Vikram) (@debayan4u) May 20, 2016

It's not immediately clear as to whether or not Vikram did indeed receive a phishing email, but we know scammers frequently leverage events like this to extract personal data from people. For example, after Heartbleed back in 2014 we saw a surge of phishing emails preying on the fact that people were expecting password reset notifications. It's a clever piece of social engineering that exploits the fact that many people will have their defences down - "oh, that'll be the password reset I heard I should be expecting, let me just enter my credentials..."

As painful as this entire incident is becoming for some LinkedIn members, it's hard to blame the choice of password storage on the folks who are now dealing with this incident. Here's why:

Inheriting a data breach

There's been some pretty vocal criticism of LinkedIn's handling of the breach and Per makes many important observations in there, including ones similar to mine regarding the reset process. From a pure security perspective, I agree wholeheartedly with him, but I'm also a little sympathetic at the same time.

Spare a thought for Cory Scott, LinkedIn's CISO. Cory has publicly blogged about the hack and no doubt bearing the brunt of the subsequently unfolding security dramas. But the attack didn't happen under his watch, he inherited what would later become a very public incident after arriving at the company in 2013.

There will be many others in the same boat as Cory who are now finding themselves at the receiving end of feedback such as this:

#linkedin and their #stupid way of saving #passwords in SHA1 unsalted. Very professional... OHG.
— leosos (@leosos) May 20, 2016

Of course SHA1 was a bad choice. It was a bad choice in 2012 let alone years later when the news is hitting (although I believe the password hashing algorithm was changed around the time of the original leak 4 years ago). But it wasn't Cory's choice and it didn't happen under his watch.

The point is that the folks dealing with this incident today are cleaning up someone else's mess. I'm sympathetic to those dealing with the breach and as much as we may feel tempted to blame LinkedIn as an organisation, it's worth remembering there are people dealing with this that are having a pretty miserable time through absolutely no fault of their own.

But of course for me, one of the things at the forefront of my mind after such an incident is how I'm going to deal with the data as it relates to HIBP. This is a bit nuanced, so let me try and fully explain it here:

Have I been pwned

I want to talk about HIBP and the LinkedIn data in a moment, but I want to share a portion of an email I received from someone this weekend first:

I'm not sure if you're aware of leakedsource.com - it's vaguely similar to haveibeenpwned.com in that it allows visitors to find out if they appear in dumps, except it feels like it's being run with more criminal intent. In contrast to your site, they don't give any background on themselves, they also have a paid subscription which allows access to the entire database, including other people's details and plain text passwords. On top of that, to remove one's self from the database for free, you're expected to send even more personal data to them, and then it's done manually, for which they conveniently have a "huge backlog".

I've been aware of Leaked Source for a while and I have thoughts on how they operate which are similar to what's represented in the comment above. What the person who sent me that email wasn't aware of at the time though was that Leaked Source subsequently received some pretty stern words from LinkedIn, most notably the following:

We have demanded that parties cease making stolen password data available and will evaluate potential legal action if they fail to comply

Obviously redistributing passwords is not going to go down well given the likelihood of abuse not just on people's LinkedIn accounts, but on their other online assets too. I don't necessarily think these guys have malicious intent, but I hope they can find a way to run the service that doesn't involve exposing sensitive data such as credentials (to their credit, they did remove the LinkedIn passwords from visibility after they got the notice). I'm not going to comment further on their service in particular but I do think it's important to talk about how data breaches are handled.

Following many of these incidents, breached data is publicly distributed and easily come by, in fact it was in the wake of the broadly redistributed Adobe breach of 2013 that I originally created HIBP. Since then we've seen many other times where breached data has rapidly spread across the web including Ashley Madison (probably the most downloaded breach of all time) and more recently, the Philippines Electoral Commission that impacted 55 million people. But just as there is an opportunity to do useful things with the data that genuinely helps those impacted, it's also very easy to make the whole situation a lot worse for all those involved.

My views on how data breaches are handled continue to evolve over time as I understand the impact of them better. For example, before the Ashley Madison data even went public I'd decided that it should never be publicly searchable. Whilst I knew it would be broadly distributed if it ever saw the light of day, it wasn't going to be HIBP that told someone's boss or wife or kids that the guy had an AM account. With the benefit of hindsight, that's almost certainly the single biggest factor that ensured I never received a DMCA take down from Avid Life Media when so many other search services did (many would actually return all the data found on someone to anyone who searched for it).

More recently, I made the decision to permanently delete the VTech data. Not just make it privately searchable but actually permanently and irrevocably delete it. I explain why in that blog post and it essentially boiled down to me being one of only 3 people to ever have it (including the since-arrested hacker and the journalist who sent it to me), combined with the fact that the data included kids who just shouldn't be caught up in this sort of thing (their average age was only 5 years old). A year ago I probably wouldn't have seen that as being necessary but again, my views are evolving as I better understand the data breach landscape.

What I'd most love for those of us dealing with this class of data to continually ask ourselves is this question:

How can we help people impacted by data breaches without making life worse for them?

I have no problem with people commercialising services that do that, but let's do it in a way that doesn't put people at further risk.

In terms of LinkedIn, a couple of days after the incident I had multiple people approach me with the data, people from very different walks of life at that (I highly doubt they'd shared the data with each other). Combined with the other indicators I mentioned earlier suggesting that the data is now starting to spread, I made the call to load it and as of now, it's searchable within HIBP. As with any other data breach that's searchable, it remains only searchable by email address and will only return a [yes/no] for whether an account exists in a breach, never any data attributes contained in the breach such as passwords.

I took some time before loading the data to get a good sense of how it was spreading, who had it and whether it was being abused. It's now obvious that somewhat predictably, it's in many hands and people are doing bad things with it. One of the key factors that ultimately drives me to load any data is that people have come to rely on HIBP as a trustworthy if not canonical resource for data breach information. I've had a lot of queries like this:

@troyhunt Do you know if https://t.co/bVYm1Ugdvk use the data of the last linkedin leaks?
— Tom (@kermiite) May 23, 2016

Not only was there a marked up-tick in traffic once the LinkedIn news broke, I've also now got 430k subscribers to whom I've committed to notifying when they're exposed in a breach. They don't pay me anything (it's a free service), but this is precisely the type of incident they're relying on me to notify them of.

This breach now pushes HIBP past half a billion records. Thank you to everyone who's supported the project via encouragement, suggestions, technical know-how and of course donations. Whilst I don't particularly want to see circumstances that add another half billion records, I'll continue to run the service whilst it remains both useful and viable.

Security LinkedIn

Observations and thoughts on the LinkedIn data breach