Back in July last year, Scott Helme and I shipped a little pet project that tracked the world's largest websites not implementing HTTPS by default. We called it Why No HTTPS? and it gave people a way to see the largest websites not taking transport layer security seriously. We also broke the list down on a country-by-country basis and it quickly became a means of highlighting security gaps and serving as a "list of shame". I've had many organisations reach out and ask to be removed once they'd done their TLS things properly so clearly, the site is driving the right behaviour. Today, we're happy to share the first update since November last year.
The Web is More Secure More of the Time
Let's start with the good news: since the first release of this little project, HTTPS adoption has steadily trended upwards:
We've gone from 70% of all HTTP requests being over the secure scheme to 80% which is a pretty good effort in a relatively short amount of time. But, of course, it's the websites serving that remaining 20% of traffic that I want to focus on here. Let's being with where we source the list of top sites from and that's something we've changed for this release.
Bye Bye Alexa, Hello Tranco
When we launched the site, the list was based on the Alexa Top 1M. However, this list was becoming somewhat tricky to use reliably as Scott explained in October:
I used to use the Alexa Top 1 Million for this research but I've been having issues with the list. They tried to remove access at one point and while I managed to have it restored, there are other issues too. The accuracy of the data has been called into question and also the list itself has been having weird issues recently like not returning 1 million entries... Yep, that's right, the Alexa Top 1 Million list has been returning, in some cases, only ~650,000 entries recently, which is of course a problem.
Consequently, there are some differences in the way sites are ranked and as a result, there are some unexpected appearances. For example, the 21st largest site on the global list is googletagmanager.com. Now obviously this isn't a website in the sense that folks go there looking for useful content (many would argue quite the contrary), but based on the Tranco data it's one of the most traffic'd websites in the world so it's within scope of this project.
So that's our starting point in terms of identifying which sites we assess, let's move onto the methodology around how a site ultimately makes our list.
Methodology and False-Positives
A quick recap on our methodology first: Scott runs a service which indexes a whole bunch of security things on the world's top million websites each day. He publishes the results of that effort via his free crawler.ninja website (really Scott, .ninja?!) and I then roll the HTTP sites and HTTPS sites list into the Why No HTTPS? website. In that regard, it's quite simple. Except it's not really...
As I explained in this Q&A blog post last year, there are a whole bunch of reasons why a site that you see apparently doing things right might still be on our list. If you're going to chime in here with a bit of "But [blah].com loads over HTTPS by default for me", do please start by reading that blog post.
Read the post? Good! What we're left with pretty much boils down to an expectation that a site responds to an HTTP request over the insecure scheme with either a 301 or 302 (ideally the former so it's a permanent redirect) to a secure URL (multi-hop is also ok: a 301 to an HTTP address that then 301s to an HTTPS address is fine). If I make an insecure curl request from here in Australia, for example, and I get an HTTP 401 then the site goes onto the list. There has been some dissatisfaction over this methodology due to how much website behaviour can vary from location to location, so in this update we've added a means of getting a "free pass" that will automatically exclude a site from the list.
HSTS Preload Gives You an Immediate "Free Pass"
Preloaded HSTS is awesome (here's an old blog post that explains why). Once a site is pinned into the browser's static list of HSTS sites, insecure requests will always be upgraded and the 301 / 302 done by the website becomes redundant. Further, check out the requirements to be preloaded in the first place, in particular, this one:
Redirect from HTTP to HTTPS on the same host, if you are listening on port 80.
What this means is that if a site is in the preload list, we're comfortable excluding it from our list of shame. A great example of this is the domain I mentioned earlier - googletagmanager.com. When I curl that address insecurely, here's what happens:
Arguably, this should keep the site in scope of being on our list but because it's been successfully preloaded and the browser simply won't allow an insecure request, it gets a free pass. Other notable "free pass" sites include hyatt.com (a curl for me just 301s to a www prefixed address served insecurely) and... haveibeenpwned.com:
Over many years I've carefully honed a bunch of Cloudflare firewall rules to identify non-browser traffic that doesn't adhere to expected norms. The response above serves a body containing anti-automation (CAPTCHA) over the same scheme the request was made to (a Cloudflare behaviour). You shouldn't ever get that response in an actual browser but if you did, the fact that HSTS has been preloaded for the domain for years means the request would automatically be upgraded hence HIBP is really a false positive.
This practice of giving HSTS preloaded sites a free pass is something we hope will drive more websites in this direction. The next time someone reaches out and claims their site is incorrectly categorised that's going to be my first response - preload your domain then the next update to the site will keep you excluded.
Check the Diffs on GitHub
Lastly, if you'd like to see exactly what's changed in the data set, check out the public GitHub repository. You'll see all the input data and all the output data, the latter being precisely the files that drive the Why No HTTPS? website. I personally find it interesting to look at diffs on files such as the top50-au.json one as it gives me a really good sense of what's changed. I've ordered these files by domain name rather than rank to make things a little easier, but of course with ranks regularly changing anyway then the move from Alexa to Tranco there's going to be a heap of changes from last time even if the HTTPS status hasn't changed. At the very least though, it makes it super easy to see which sites have now dropped off the list altogether.
There's always a bunch of feedback on these releases and people often find really interesting things in the data. Do chime in below, keeping in mind the earlier point about reading the Q&A blog post first. And, of course, please continue to use this site as leverage to move more organisations in the "secure by default" direction.