Troy Hunt: Gab Has Been Breached

I've investigated hundreds of data breaches over the years (there are 514 of them in Have I Been Pwned as I write this), and for the most part, the situation with Gab is just another day on the internet. But Gab is also different, having grown dramatically in recent months as an alternative to mainstream incumbent platforms such as Twitter and Facebook and drawing a crowd primarily focused on right wing American politics.

A couple of days ago, I posted a thread about their alleged breach. I want to go back through that thread here, explain the thinking further and then provide some commentary on the actual data that was exposed. It all began here:

So, the @getongab data breach situation: Let's start the bizarreness with their CEO's ridiculous statement tweeted yesterday: https://t.co/NyKmEPI0I8
— Troy Hunt (@troyhunt) March 2, 2021

Much of the problem with objectively discussing this breach is that it's impossible to escape the transphobic slurs and religious rhetoric being dished out from the guy at the top. I don't care which god (or demon) you've picked, nor what gender you were born with (or if you decided to change it at some time), nor do I care whose politics you like and whose you don't, I only care about the data. More specifically, I care about the data that's been exposed in the breach, especially when that data may include my own (I'm very serious).

This came a couple of days after their post about an "alleged data breach" which is full of pretty bizarre statements: https://t.co/qmSIkdKg4l
— Troy Hunt (@troyhunt) March 2, 2021

It's pretty standard practice for an organisation to post a public statement following a breach or even, as the opening sentence of that page suggest, an "alleged" breach. Most organisation begin with "we take the security of your data seriously", layer on lawyer speak, talk about credit cards not being exposed and then promise to provide further updates as they come to hand. Gab's approach... differs:

For example, because they couldn't find any public discussion about the breach they assumed that @WIRED reporters were "essentially assisting the hacker in his efforts to smear our business". There are *always* discussions held in private about a breach before it's made public.
— Troy Hunt (@troyhunt) March 2, 2021

Because Gab "searched high and low for chatter on the breach on the Internet and found nothing", they've drawn the conclusion that reporters are maliciously working with hackers. I've had dozens of occasions where I've known about a breach, there's been no public discussion on it, and I've worked with reporters to help get to the bottom of what's happened. This is normal. It's so normal that the last time I did this was earlier this week with Lawrence Abrams from Bleeping Computer on the Dutch Ticketcounter breach.

"It is standard practice for passwords to be hashed. If the alleged breach has taken place as described, your passwords have not been revealed." This is misleading and ignores the simplicity of hash cracking. If your password is "maga2020!" (or similar), it has been revealed.
— Troy Hunt (@troyhunt) March 2, 2021

If you're not familiar with hashing, how it's not the same as encryption and how it can still leave passwords vulnerable, read this primer from September first. As it relates to passwords being revealed, you can't "unhash" a hash in the same way as you can decrypt an encrypted piece of text, however, you can always guess passwords, hash them with the same algorithm (and salt if present) and see if the output matches what was stored. For example, when I wrote about the Dropbox hack in 2016, I was able to verify my own record simply by hashing the password I had stored in 1Password and comparing the output to the one in the breach. It matched, therefore verifying the legitimacy of the breach. The following year I showed how even though CloudPets had chosen the very robust bcrypt algorithm for password storage, I was still able to crack a bunch of them courtesy of their extremely weak password rules:

"It is entirely possible for a user of the site to be unidentifiable based on the information they provide at login." You login with your email address. This (almost always) identifies you, it's literally how people communicate with *you*! pic.twitter.com/86klDc37nF
— Troy Hunt (@troyhunt) March 2, 2021

I do actually agree with the quoted sentence insofar as someone could create an email address completely disassociated with them, register for Gab and then login with that account. But that almost never happens because Gab is used by normal humans just wanting to interact with other normal humans and it's not a platform where people are likely to take extra precautions to conceal their true identity. When faced with a registration form that requests an email address, the vast majority of people will simply provide the same email address they use everywhere else, hence my "almost always" comment.

"In our subscriber records we do not collect health or financial information; we do not collect dates of birth; we do not collect [blah blah]." When you've just had your neo-Nazi hate speech associated to your email address leaked, DoB is the least of your worries!
— Troy Hunt (@troyhunt) March 2, 2021

This isn't an unusual response to a data breach; many companies try to downplay the significance in order to reduce the perceived impact of it. I wrote about this in 2015, specifically as it relates to organisations focusing on the security of credit cards which are one of the most easily replaceable and low-impact classes of data to have exposed. All of the classes of data Gab mentions pale in comparison to the impact of having extremist messaging exposed in connection to a personally identifiable data attribute such as someone's email address. And regardless of your political persuasion, it's clear that a platform designed to have a bare minimum of controls on content (although they do define content standards) is going to attract and retain more extreme views; that's part of the attraction for many people.

"Every major tech company – from Facebook to Twitter – has been the target of multiple and continued data breaches." AFAIK, neither of these companies have ever had their entire DB dumped in the style @getongab appears to have, nor would that be an excuse if they had.
— Troy Hunt (@troyhunt) March 2, 2021

This is also fairly common to see in a post-breach announcement, either in generic terms ("as you know, data breaches are very common") or in Gab's case, directly pointing the finger at competing services. The comment is intended to normalise the data breach and downplay its significance, the exact opposite of what we want to encourage in this industry. A few years ago I wrote about how to construct a breach disclosure notice and paid particular attention to how well the Red Cross Blood Service handled theirs. It's little things like apologising; rather than downplaying the incident and directing attention elsewhere, we need to see organisations standing up, copping it on the chin and acknowledging their faults.

Then there's the @WIRED piece from @a_greenberg, a top-notch journo I've got a lot of respect for based on previous pieces he's written and many discussions I've had with him personally: https://t.co/JHp0nMNE0a
— Troy Hunt (@troyhunt) March 2, 2021

The WIRED piece is well worth a read and sheds more light on the events leading up to the breach. I've always found Andy Greenberg to be not just a very switched on infosec journalist, but also a genuinely nice guy I've enjoyed speaking with in the past. I can't imagine Andy being anything but professional in his interactions with Gab and it was only whilst writing this very paragraph that I saw a tweet which might explain why he was treated with such disdain - he may have picked the wrong religion:

As per my policy of not communicating with non-Christian and/or communist journos, I will not be replying to this non-story.

It's not a real email address, therefore it is not checked. It's just a placeholder email I used when creating the account almost five years ago. pic.twitter.com/BK1bCldFNe
— Gab.com (@getongab) March 3, 2021

As much as I didn't want this post to touch on religion, it's hard to ignore a comment like that which literally excludes the vast majority of the earth's population (and I'm guessing a fair chunk of Christians would be appalled by this statement as well).

DDoSecrets has a @getongab page saying: "Due to these concerns, along with presence of passwords and other PII, this dataset is currently only being offered to journalists and researchers." I'd love to get this into @haveibeenpwned, if anyone knows anyone there, ping them for me.
— Troy Hunt (@troyhunt) March 2, 2021

Following this tweet, I did indeed get in touch with someone and obtain a copy of the data. But before I delve into that, there's just one more tweet in that thread I want to embed:

Weak, pathetic, and emasculated men like you are why the West is failing.
— Gab.com (@getongab) March 2, 2021

I'm amused by this, more than anything. For the most part I thought my analysis was pretty objective and Gab (whose account seems to simply be the mouthpiece of their CEO, Andrew Torba) hasn't really made it clear which bit they disagreed with, so let's solider on and objectively look at the data just like with any other data breach.

In a 2.99GB file called accounts.sql, there are just over 4M rows of data largely consisting of user records. Because I myself have a Gab account which I created when started making commentary on them and Parler in Jan, naturally the first thing I did was to pull out my own record:

Looking into the (alleged) @getongab data breach, many records don't have an email address or a password hash (mine has the former, but not the latter). But for verification, don't those dates and times look... similar. Coincidence? Or real breach? (Aus time in @1Password) pic.twitter.com/13ihm27lsV
— Troy Hunt (@troyhunt) March 3, 2021

Per the tweet, there's no hash against my record so I can't verify the password matched the long random one I created in 1Password, but it's obviously pretty clear the data is legit based on the alignment of the dates. In total, the file has 43,015 unique email addresses (including mine) which is a far cry less than the total row count. Why? At a guess it would come down to how the data was dumped. There are actually bcrypt password hashes against many records, but they also only represent a subset of the total with 7,097 of them in all. Having access to these hashes gives us an opportunity to debunk Gab's earlier claim that "your passwords have not been revealed", an exercise that's made particularly easy due to their password criteria which can be seen on the registration page:

Requiring 8 characters isn't unusual (it's possibly even on the high side), but that's the only criteria. What that means is that it's easy to take a list of the most common passwords of 8 characters or more, pass them into hashcat and bingo, "your password has been revealed":

Yes, apparently Gab will let you have a password that is literally "password".

Andy mentioned the presence of a chatlog.txt file in his story and the data is pretty limited here at only 9.53MB in size. The content ranges from an extensive amount of religious scripture to very intimate messaging between 2 members to someone sharing a radio show which they close with "We hope you enjoy the show and share it with white families". To be clear, these are intended to be private messages and not something Gab should be responsible for moderating (for obviously privacy reasons), but they do give an insight into the interests of their members. It also speaks to my earlier point about this breach being significant as it ties identities to their messaging. Some of the private messages are by most standards, recalcitrant, and they sit alongside the Gab username which then exists in the accounts.sql file which then points to their online profile and may also include their email address. Plus, there are multiple messages in which people have shared their personal phone number, often to take the conversation onto WhatsApp. You can immediately see the risk to individuals.

The groups.sql file Andy also mentioned is much more benign. It's 31.8MB worth of Gab group information spread over nearly 32k lines. I suspect there's little risk posed by the exposure of the data other than that it simplifies the exercise of analysing the nature of the groups people have created. One thing that seeing this file helped me understand is that as much as Gab has gained notoriety for housing certain types of content, there's a heap of run of the mill stuff that'd barely raise an eyebrow. For example, there's the German Shepherds group, the brewing group or even the Dads of Gab group which is all about "A group by fathers, for fathers. Topics should be about how to properly parent your sons...", ok, good, this is sounding good "...and how to police your wives". Aw crap. I honestly tried to focus on the positive but it's very hard to go far without running into content which, well, let's just say "doesn't sit well with most people". Micah Lee from The Intercept did a quick analysis on the largest groups:

Of the top 20 biggest groups (sorted by most members) on Gab:

5 are devoted to Trump/election misinformation
5 are devoted to QAnon pic.twitter.com/L0sLHJtEAh
— Micah (@micahflee) March 2, 2021

And then there's the big file - statuses.sql with 62.4GB of data in it. This appears to be precisely what the file name suggests - statuses posted to Gab. For example, the first row appears as follows:

105295113355799222	3146	{"id": "105295113355799222", "url": "https://gab.com/mwill/posts/105295113355799222", "card": null, "poll": null, "tags": [], "group": null, "quote": null, "emojis": [], "reblog": null, "content": "@TImW381 There's a reason I blocked you.", "language": "en", "mentions": [{"id": "979864", "url": "https://gab.com/TImW381", "acct": "TImW381", "username": "TImW381"}], "pinnable": false, "has_quote": false, "reblogged": false, "sensitive": false, "created_at": "2020-11-29T18:52:04.042Z", "expires_at": null, "favourited": false, "revised_at": null, "visibility": "public", "quote_of_id": null, "rich_content": "", "spoiler_text": "", "reblogs_count": 0, "replies_count": 0, "in_reply_to_id": "105294344326089419", "plain_markdown": null, "favourites_count": 0, "media_attachments": [], "pinnable_by_group": false, "bookmark_collection_id": null, "in_reply_to_account_id": "979864"}	2020-11-29 13:52:04.042	\N

The URL at the beginning is a public post reflecting the contents in the .sql file:

Based on the reporting I've seen and as best I can tell from reviewing the data, this is all public information. That also appears to be Gab's position and reflects their messaging in their Update on the Gab Breach post from yesterday:

From what has been reported and from what we know thus far, the overwhelming majority of the data in this breach is already public on the website for anyone to see.

So, what's the harm? Is it a "breach" if it merely exposes data that's public by design anyway? Well, it depends, primarily because large volumes of easily queryable data downloadable in aggregate can offer a lot more insight than manually trawling through individual pages. I don't think this presents any (additional) risk or harm to the individual users of Gab who posted public content, rather it makes it a hell of a lot easier for researchers to analyse and form empirically derived conclusions across large volumes of data. It also made it easy for me to identify more email addresses that appeared in public comments, 24,017 of them in total. A bunch of those were already in the accounts.sql file and a portion were also false positives (email addresses are extracted via regular expression and may include other data that matches a valid address pattern). But many others appeared in posts, for example a big list of UK politicians in a post about "traitors". In total across all files in the breach, I identified 66,521 unique addresses. This is obviously only a very small subset of the ~4M users of the service and on that basis, it's a minor breach in terms of personal information exposure.

And what about the bug that caused all this havoc to begin with? Dan Goodin (another infosec journo I respect) wrote up a piece yesterday on the rookie coding mistake made by the site's CTO, namely the presence of SQL injection. It might be a simple mistake, but it's also an enormously common one and still rates as the number one application security risk on the web today. It's an easy mistake to make, but also an easy one to catch and I can't help wondering if its presence reflects a lack of both peer review and automated analysis, both of which would likely be neglected by an under-resourced company dealing with the massive rate of growth Gab has seen this year.

As I was analysing this data and considering how to approach it with HIBP, I was quickly drawing the conclusion that this shouldn't be publicly searchable. That view was reinforced by James Roper's tweet:

Curious, will you consider this breach a "sensitive" breach? Given that an email addresses presence in the dataset potentially reveals something about that persons political beliefs? I could imagine employers looking up a candidates email address and excluding them based on that.
— James Roper (@jroper) March 3, 2021

James is spot on in that there is a very strong political leaning implied by a presence in the data, and likewise a strong religious affiliation to Christianity. It differs from place to place but for the most part, both political and religious views are classified as sensitive information and it's only fair for me to treat the data with that level of respect. There's also the risk of incorrectly assuming that a presence in the breach implies your views have some degree of alignment with those regularly expressed on the site, yet clearly based on the presence of my own email address, that assumption is incorrect. If you want to see if your own email address is in the Gab breach, you'll need to verify that you control it via HIBP's notification service. If you use the domain search feature and can verify that you control the domain, you'll be able to see which aliases on that domain appear in the breach. The data is not publicly searchable from the front page nor is it available via the API.

One final observation based on my time spent trawling through Gab: a platform like Facebook is global in nature. It's approaching 3B monthly active users and a "mere" 190M of them are in the US, so call it about 7% of the audience. But 190M is still a huge number and within that number are tens of millions of active users with the same political alliances as those on Gab. Facebook has many, many flaws, but lack of a diverse global audience is not one of them. That's Gab's flaw. The content is extremely US-centric (it's difficult to derive from the leaked data, but my gut feel is 90%+) and its obviously very much right-wing politics. If in any doubt about the singularity of Gab's content, just browse through the explore page: at the time of writing, the first few pieces of content are a climate change denying piece, a post deriding those pushing back against the Texas governor ending the state-wide mask mandate, a post from a Republican politician Wikipedia describes as a "far-right conspiracy theorist", a derogatory piece on Mitch McConnell, someone complaining they were censored on Parler and posting a Photoshopped image of Biden in front of a Chinese flag, a post from "far-right fake news website" (again, Wikipedia's description) about 2020 election fraud, one claiming the FBI is corrupt and that the KGB is more trustworthy... And the trend just continues. Gab isn't a free speech alternative to incumbent tech behemoths with a rich collection of wide-ranging views, rather it's an echo chamber of the same political, religious and scientific opinions.

Gab is now the 515th data breach in HIBP. At the time of writing, I'm yet to receive a breach notification about the exposure of my Gab data.

Security Have I Been Pwned

Gab Has Been Breached

Troy Hunt

Upcoming Events

Must Read