When Is Data "Public"? (And 2.5M Public Factual Records in HIBP)

When is data "public"? And what does "public" even mean? Does it mean it's merely visible to the public? Or does it mean the public can do anything they like with it? This discussion comes up time and time again as it did with the huge leak of PDL data only last month. For the most part, the impacted data in this incident came from LinkedIn, a service where by design we (including myself) publish personal information about ourselves for public consumption. So what's the problem? Willingly publishing your personal data online in a specific context is one thing, an organisation then taking it providing it another context is... unsettling:

As I said in the intro to the PDL blog post, the pattern of data being collected via data aggregators then being redistributed outside their control (whether you call it a "leak", a "spill" or a "breach") is becoming alarmingly common. Subsequently, when someone recently sent me an alleged breach titled "factual.com_places_db_8M", it wasn't overly surprising. Factual is "a location data company that helps marketers and their organizations use location to better understand, reach and engage consumers". They were mentioned in the recent New York Times piece titled Twelve Million Phones, One Dataset, Zero Privacy which is an absolutely fascinating story. To be fair to Factual, their position in that story represents them on a higher moral ground and there are certainly many shades of grey when it comes to the ethics data aggregators operate under.

Moving on, the data allegedly sourced from Factual included a 1.1GB CSV file which contained fields for name, email, country, region, locality, address, postcode, latitude, longitude, tel, fax and website. The create date on the file was 22 March 2017. As the filename suggests, there were almost 8M records although "only" 2.5M unique email addresses (many records didn't include this field). For the most part, the data fell into the "public" category insofar as I'd expect to be able to go and locate much of it on a record-per-record basis, yet there's also the question of whether it should exist in one aggregated location in the first place and whether the owners of the data expected it to be used in this fashion. So I asked people; I emailed a handful of the near 3M subscribers I have in HIBP who appeared in the data set and I asked them 3 questions:

  1. Is this data about you accurate?
  2. Do you consider it to be public domain info that should be redistributed?
  3. Should it appear in HIBP?

On the first point, responses varied:

It is not accurate
It appears to me that all of the information below is publicly available either on our website or job postings currently.
The data is not accurate and most is fake
Yes, I was a travelling notary at one time.
I don’t work for [redacted] and that is not my phone number or address. The only thing correct was my name, email and website.

I found these responses interesting from the perspective of how reliable services from aggregators really are. With the caveat that I have zero insight into where Factual actually gets their data from, I would imagine that large scale aggregation from public sources would be fraught with data integrity challenges. Be that as it may, we're still talking about millions of people's email addresses popping up in unexpected places, which brings me to the responses to the second question about whether this data should be online in this fashion:

No, that address is private, and is a home address on my credit report.
As the information appears to be fake it should not be available online to the public
My email should not be associated with those other names/adresses/website that you found..
I live in Norway, don't run any businesses - other than just the tiny personal etsy shop you see linked below!

These responses speak to my point in the opening paragraph about the expectations people have about how their data is used, regardless of how visible it is online. And as for the final question about whether the data should be loaded into HIBP, the responses were a unanimous "yes" so as of now, it's searchable along with the other 9.3B records already in the system.

To ensure Factual were aware there was data circulating that was attributed to them and claiming to be a "breach", I got in touch with them and privately disclosed the incident. To their credit, they were receptive, responsible and professional in the way they responded and provided the following quote:

Factual has reviewed a data file provided by Troy Hunt and determined that the file contains publicly available information about businesses and other points of interest that Factual makes available on its website and to customers. The company does not believe the information was obtained from a source other than its public website.

The data includes business names, locations, website addresses, hours of operation, and contact information that the businesses themselves have made public, such as on their websites, in directories, and on social media. It is similar to data available from other public sources such as yellow pages data and mapping apps on mobile phones.

Those interested correcting business information that may be personal data under GDPR or other applicable privacy law are encouraged to reference our privacy policy for more information:

We appreciate Mr. Hunt’s efforts to notify us and his assistance with information to facilitate our investigation.

The last thing I'l leave you with is a tweet from earlier this month which is relevant to the Factual situation. Jeremiah poses a really interesting question and the responses make for some good reading about what's changed culturally:

Have I Been Pwned
Tweet Post Update Email RSS

Hi, I'm Troy Hunt, I write this blog, create courses for Pluralsight and am a Microsoft Regional Director and MVP who travels the world speaking at events and training technology professionals