Because I don't have enough things to do I decided that I'd launch yet another project! This one came with a slightly lower overhead though so it's not too bad but I'd like to introduce you to crawler.ninja!
My new and improved crawler
Regular readers will know that I've got a crawler fleet that's used to crawl the Alexa Top 1 Million sites and produce a report every 6 months on the state of security in those top 1 million sites. You can see all of those reports going back 3 years now:
February 2018
August 2017
February 2017
August 2016
February 2016
August 2015
I've made countless improvements to the crawler which are detailed through each of the reports above. I'm now tracking everything from Security Headers and TLS configuration to CAA records and issuing CAs. The crawler fleet runs every day and the raw data from the crawl is available with the reports being published every 6 months.
Why No HTTPS?
The data from the daily crawl is now being used to power the project that Troy Hunt and I have just launched, whynothttps.com!
You can read all about what the project is on the site itself or in Troy's launch blog post.
The data
There are a few different things available from the crawler so here's a quick summary.
Raw Data
This is the mysqldump
of the crawler's table after the crawl is completed and contains all of the raw data the crawler uses to generate the statistics. The zip files also contain the stats files themselves in both txt and json format along with the Excel spreadsheet used to generate the graphs. This is literally the whole lot and each day the crawler produces a ~1.4GB file.
Latest Crawl Report
This will always link through to my blog post containing the latest crawl report. It will detail all of the changes since the last report, have all the pretty graphs and statistics and detail any changes or new metrics added to the crawler.
High Level Statistics
This links through to the txt and json files I mentioned above. They're available for easy reading/parsing if you want to use the data each day but not have to download the entire raw file. These files are also updated each day automatically.
Support the project
It takes a good amount of resources to crawl and analyse 1 million sites every single day! That's a lot of HTTP requests and redirects, DNS lookups, connections with s_client
and much more which all take bandwidth, CPU and RAM. Then there's the database to store all of the data, generate the statistics from and the hosting for the site and files themselves. I run this project out of my own pocket and all of the data is freely available and openly licensed because I want it to be useful. That said, it does cost money and it's not exactly a small amount every month! If you can support the project then PayPal donations can be sent to [email protected], you can donate through PayPal Me, support me on Patreon or even consider sponsoring my site!
Use the data!
The data is all licensed CC BY-SA 4.0 as detailed in the footer of the site so you can take the data and use it, even in commercial projects, with attribution. If you do find an interesting use for the data please do let me know, it'd be great to see how it's being used by the wider community.