Publishing my daily crawler data for wider analysis

I've been running crawls of the Alexa Top 1 Million and publishing results every 6 months for the last 2 years. As promised I'm now opening up my daily crawl data to the wider community to see what awesome things you can do with it.


The crawls

I announced in my most recent crawl, Alexa Top 1 Million Analysis - Feb 2017, that I was now crawling the Alexa Top 1 Million and analysing it daily and that I wanted to open up the data. The problem I had is that the Alexa crawl outputs ~1.4Gb of data and I also wanted to start crawling the Cisco Umbrella Top 1 Million too. The horsepower to run the crawls wasn't the problem, it's cheap enough to come by a fair quantity of compute power for a short period every day. The problem was how and where to host the increasing quantity of data and factor in the bandwidth requirements too. Fortunately, someone has stepped forward and offered to host the data for me, free of charge, so it can be available to anyone that might wish to use it!


Censys to the rescue!

I'm really pleased to announce that Zakir Durumeric (@zakirbpd) of the Censys.io project stepped forward to offer me hosting and bandwidth for my crawl data completely free of charge. I'm really grateful to Zakir for such a kind offer and I'm happy that the data will be freely and openly available to the community. I've already started the process of uploading and indexing the existing data and thanks to the access I've been given I will be able to automate this process in the future too. That means that each day there will be fresh data available to download and use. Simply head over to the Scans.io site and you can see my data in the list.


scans.io site



You can jump straight to the download page for my crawl data and most of the data from this year should already be up there and available for download by the time I hit publish on this blog.


file download links


Further analysis

I've already seen some really interesting insights gleaned from the data, things that I hadn't considered or I'd overlooked, so I'm hoping there's a lot more to come from it yet. Please feel free to grab the data and have a play around with it yourself. If you do anything with it then it'd be awesome to hear about it so drop something in the comments below!

Author image
About Scott
Researcher, blogger and international speaker. I'm the creator of report-uri.io and securityheaders.io, free tools to help improve online security.