Handling people's personal data is sensitive business

Last week I wrote about how 8 million GitHub profiles were leaked from GeekedIn's MongoDB which is always a risk when you expose a DB with no auth whatsoever! For any other website, this would be a typical data breach scenario in that info that was meant to remain private was made public. However, GeekedIn lost publicly accessible GitHub data so whilst yes, there was a breach, no, it wasn't anything you couldn't get publicly anyway. So what's the big deal?

I expected there'd be people in both camps on this issue - those who couldn't care less and those who were upset - but I was surprised at both how passionate each side was and how biased the vast majority was towards one end of that scale. Here's how I personally felt about the incident:

whilst this appeared to be data that was obtained from publicly facing GitHub profiles, it felt wrong to see it all aggregated together here like this

But if you have a read through the comments on that blog post, there were a couple of people in particular who were ardently against me on this. That's fine, you're always going to have different views on these things, but there were some pretty strong opinions usually along these lines:

Why is it wrong to see publicly available data aggregated?

And this is the paradox: if you place your information on a publicly facing resource, should it then be a free-for-all for anyone to do whatever they want with? I mean it's now public domain data, right? Others agreed:

In this case that expectation should be that it is completely public and can be used for anything. Github is in no position to forbid anything about content they serve publicly.

But this is where it starts to get a murkier - the assertion that a website is unable to define the terms by which people use it - because that's incorrect. Here's where GitHub stands on that:

Using scraped information for a commercial purpose violates our privacy statement and we do not condone this kind of use.

There are many, many precedents of legal trouble arising as a result of violating a website's terms of service and using otherwise publicly accessible data in a way they prohibit. Most notable was Aaron Swartz pulling data from MIT and as the linked article states:

The "terms of service" (TOS) of any website are basically a contract. They constitute an agreement about what you can and can't do, and what the provider can and can't do.

That "agreement" was pursued to tragic consequences in Aaron's case and I highly doubt that GitHub would ever actually go on the offence against a little startup like GeekedIn, but nobody should be under the illusion that someone else's website is fair game to do whatever they want with. That's a little bit tangential to the whole point about handling personal data though, let's get back to that:

I find it amusing that people would be worried that their email, which they published on github already, was "leaked", but then they would have no issue confirming it's an actively used email by signing up at HIBP

This one starts to get to the heart of the issue and it's simply this:

People provided their data to GitHub with an expectation of how it would be used. GeekedIn took that data and in violation of GitHub's terms, re-purposed it to do something completely different and for their own commercial gain.

I was seeing a lot of comments from people who were pretty unhappy about this situation, indeed many debating with those who'd left the comments above. So I thought I'd send out a poll:

When two third of respondents are unhappy about this and only a single digit percentage genuinely couldn't care, there's obviously something upsetting people. Someone pointed me over to a page on LinkedIn where the GeekedIn service was originally launched (you need to be logged in to LinkedIn to read this) and there are some very strongly worded comments there. I want to touch on some of those because they give us an insight into how the vast majority of people feel about their data being used in this way.

For many people, it's a consent issue:

Well thanks for scraping my personal data without any signup/consent and failing to keep it secure.

And that comes through time and time again:

Thanks for scraping my data off github, dude. I didn't ask to be in your dumbshit database.

And again:

Cheers, mate! Thanks a lot for scraping my data without any kind of consent.

And this one raised another interesting point:

I never gave consent to you to allow my PERSONAL email address to be used by recruiters. Where is your OPTIN which is European Law for personal emails?

As did this one:

Thanks for breaking EU and Spanish data protection laws. You know the drill - as required by law, please explain immediately where you got my data, what is in there, how you store it and how you process it. Then delete all of it, and provide proof, as required by law.

Quite a few people actually brought this up regarding opt-in and if GeekedIn's actions could be considered illegal under European law. Particularly in the EU (which incidentally, is where both the founder and the service are based), folks are very sensitive about data privacy and they have some of the world's strictest laws there governing how information can be handled. As a society as well, I find a much lower tolerance to privacy violations than in say, the US or Australia.

Legal aspects aside, much of this goes back to earlier point I highlighted around the expectation of context; yes the data was publicly facing on GitHub's site, but that's the context people were happy to have their data exposed in - on GitHub's site, not necessarily elsewhere. In fact that theme is very clear:

It's completely fair to be pissed about this, as it is 100% not the way the data is intended to be used

The commercial intentions were what upset many people; it's one thing to have your data scraped in this fashion, but it's another thing again to have someone attempt to monetise it:

You scraped a bunch of websites collecting information about individuals, without telling people about it? AND you charged money (quite a lot of money) for giving companies and recruiters that information so they can spam us with crap?

Within the comments on that LinkedIn page there were a few messages of support, albeit from the one individual who seems to be associated with the service. His defence consisted of a combination of "don't put your data on the web if you don't want it scraped", "web crawlers do this too" and "but the information isn't sensitive anyway". Now I get the defence and why he'd think this way but again, people have very different expectations here. You expect public data to be indexed for the purposes of searchability and when it comes to personal information, there are very different views out there about how sensitive it is.

Here's what really gets me with all of this though:

Despite the overwhelming majority of survey respondents and commenters being unhappy about this, there's a small minority that insist, against all evidence, that these people are wrong. They don't believe anyone has the right to feel at all violated that their data has been misused in this fashion.

The simple premise of "it's already public" is not a sufficient defence and there are many examples of where that rationale doesn't play out. I'm "already public" when I'm at the beach with my family but I don't wish for unknown people to share that with others via photographs. People may visit certain public establishments which they don't want to broadcast to the world. The decision of what should and should not be public and how it's used once it is remains a personal decision, not one upon which others project their own views to the detriment of others.

The lesson from all of this regardless of whether you think this issue is a non-event or not is precisely the title of this blog post: handling people's personal data is sensitive business. It doesn't matter where they left it or how you got it, the vast majority of people feel deeply about how it is reused and if you don't respect that, you could be heading towards the same hot water GeekedIn now find themselves in.

Tweet Post Update Email RSS

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals