Data Enrichment, People Data Labs and Another 622M Email Addresses

Until this month, I'd never heard of People Data Labs (PDL). I'd certainly heard of the sector they operate in - "Data Enrichment" - but I'd never heard of the company itself. I've become more familiar with this sector over recent years due to the frequency with which it's been suffering data breaches that have ultimately landed in my inbox. For example, there's Dun & Bradstreet's NetProspex which leaked 33M records in 2017, Exactis who had 132M records breached last year and the Apollo data breach which exposed 126M accounts, one of which was my own.

When Vinny Troia recently reached out after he and Bob Diachenko and sent me a massive set of data allegedly originating from PDL, I was both unsurprised to see more of the same yet still, surprised to find my own data:

{ 
   "id":null,
   "status":"created",
   "guid":null,
   "positions":[ 
      { 
         "id":null,
         "title":"founder",
         "description":null,
         "location":null,
         "position_type":"Past",
         "company_name":"have i been pwned",
         "company_url":"linkedin.com/company/haveibeenpwned",
         "start_date_year":2013,
         "end_date_year":null,
         "start_date_month":12,
         "end_date_month":null,
         "company_website":"haveibeenpwned.com",
         "company_size":"1-10",
         "company_industry":"information technology and services"
      },
      { 
         "id":null,
         "title":"partner",
         "description":null,
         "location":null,
         "position_type":"Past",
         "company_name":"report uri",
         "company_url":"linkedin.com/company/report-uri",
         "start_date_year":2017,
         "end_date_year":null,
         "start_date_month":11,
         "end_date_month":null,
         "company_website":null,
         "company_size":"1-10",
         "company_industry":"internet"
      },
      { 
         "id":null,
         "title":"director",
         "description":null,
         "location":null,
         "position_type":"Past",
         "company_name":"superlative enterprises",
         "company_url":"linkedin.com/company/superlativeenterprises",
         "start_date_year":1997,
         "end_date_year":null,
         "start_date_month":7,
         "end_date_month":null,
         "company_website":"superlative.com.au",
         "company_size":"1-10",
         "company_industry":"information technology and services"
      },
      { 
         "id":null,
         "title":"author",
         "description":null,
         "location":"farmington, utah, united states",
         "position_type":"Current",
         "company_name":"pluralsight",
         "company_url":"linkedin.com/company/pluralsight",
         "start_date_year":2012,
         "end_date_year":null,
         "start_date_month":11,
         "end_date_month":null,
         "company_website":"pluralsight.com",
         "company_size":"1001-5000",
         "company_industry":"e-learning"
      }
   ],
   "source":"PDL",
   "full_name":"troy hunt",
   "first_name":"troy",
   "last_name":"hunt",
   "url_profile":"https://www.linkedin.com/in/troyhunt",
   "id_external_profile":"troyhunt",
   "short_bio":"i'm a pluralsight information security author & instructor, microsoft regional director and most valued professional (mvp) specialising in online security and cloud development. i speak at conferences around the world and run workshops on how to build more secure software within organisations. i'm also the creator of the data breach aggregation service known as \"have i been pwned\".. i'm a pluralsight author, microsoft regional director and most valued professional (mvp) specialising in online security and cloud development. i speak at conferences around the world and run workshops on how to build more secure software within organisations. i'm also the creator of the data breach aggregation service known as \"have i been pwned\".. pluralsight author and microsoft most valued professional (mvp) focusing on security concepts and process improvement in software delivery within a large enterprise environment.\n\nspecialties: security, c# asp.net, azure, sql server, soa, continuous integration. information security author & instructor at pluralsight, microsoft regional director & mvp, founder of have i been pwned.",
   "is_deleted":false,
   "created_id":1111,
   "created_dt":1565870400000,
   "updated_id":1111,
   "updated_dt":null,
   "timezone_id":null,
   "timezone_name":null,
   "timezone_geocoding_latitude":null,
   "timezone_geocoding_longitude":null,
   "lip_location":"brisbane, queensland, australia",
   "is_tc":null,
   "is_payment":null,
   "headline":null,
   "industry":"information technology and services",
   "linkedin_recruiter_profile_url":null,
   "location_shape":{ 
      "coordinates":[ 
         153.02,
         -27.47
      ],
      "type":"point"
   },
   "location_level":null,
   "emails":"[redacted]",
   "phone_numbers":"[redacted]",
   "experience_years":22,
   "is_scheduled":null
}

I've redacted only my email address and phone number, everything else above is precisely as it appears in the source data. What's immediately obvious is the alignment to data I've consciously and deliberately published to LinkedIn so that it's publicly accessible. One look at my LinkedIn profile makes it very clear where much of this data has been sourced from so on the one hand, you could reasonably argue there's nothing either sensational nor particularly newsworthy about this breach. Yet on the other hand...

The recurring theme I'm finding with exposed data of this nature is increasing outrage that the data aggregator obtained and used personal information in a fashion the owner of the data (i.e. me) didn't consent to. It's not about how public the data might be through the channels people choose to publish it, rather it's about the use of the data outside its intended context. And if you're of the mind that data of this nature isn't particularly important and really doesn't warrant loading into Have I Been Pwned (HIBP), then you're part of a small minority:

Time and time again since that poll, a pulse check of subscriber sentiment has returned similar results. The vast majority of people want to know where their data has been exposed. And in this case, I'm going to keep using that term - "exposed" - rather than breached because PDL has a clear position on that (great article by Lily Hay Newman from WIRED on this incident):

The owner of this server likely used one of our enrichment products, along with a number of other data enrichment or licensing services

It's entirely possible that this data came from a PDL subscriber and not PDL themselves. Someone left an Elasticsearch instance wide open and by definition, that's a breach on their behalf and not PDL's. Yet it doesn't change the fact that PDL is indicated as the source in the data itself and it definitely doesn't change the fact that my data (and probably your data too), is available freely to anyone who wishes to query their API. I signed up for a free API key just to see how much they have on me (they'll give you 1k free API calls a month) and the result was rather staggering.

In both Vinny's and Lily's articles above, they mentions how references to Oxydata (another data enrichment company) were also found in another one of the exposed Elasticsearch indexes. Lily got a statement from them on the potential for their data to be misused by one of their customers:

We sign the agreements with all our clients that strictly forbids the data reselling and obliges them to ensure that all of the appropriate security measures are taken. However, there is no way for us to enforce all of our clients to follow the best data protection practices and guidelines.

And this is the real problem: regardless of how well these data enrichment companies secure their own system, once they pass the data downstream to customers it's completely out of their control. My data - almost certainly your data too - is replicated, mishandled and exposed and there's absolutely nothing we can do about it. Well, almost nothing...

For folks curious as to what data PDL holds on them, they have an online contact us form, a chatbot on their home page, a Twitter account, a LinkedIn page, a published email address of support@peopledatalabs.com and a dedicated opt out form. Their privacy policy states that people may "access any information we have on them" and that they will "reply to a person’s request within five business days" or delete it outright. It'll be interesting to see how that scales if even a very small slice of the 622M impacted individuals takes them up on that offer.

The email addresses are now searchable in HIBP and the incident has been titled "Data Enrichment Exposure From PDL Customer".

Have I Been Pwned
Tweet Post Update Email RSS

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals