When crawlers are hungry for porn...

I had a bit of a strange issue on Security Headers this week and at first I thought it was someone trolling me. Turns out it wasn't someone playing a practical joke but instead it was search engine crawlers, that were hungry for porn...


Security Headers

This is my free service that allows you to quickly and easily scan any website to check their HTTP response headers for security features like CSP and RP.


security-headers-homepage


Live data

As you can see there on the homepage there are some stats on the total number of scans and things like the recent, good and bad scans too.


stats


In those lists you can see the addresses of sites that have been scanned recently and just the other day I had this up on a big projector behind me, live. Someone in the room shouted out "Hey Scott, why are there so many porn sites on there?!". I glanced over my shoulder and indeed most of the Recent Scans and Hall of Shame were addresses that were, shall we say quite obviously pornographic in nature! Now, it's a live feed, that's ok, bad timing I can refresh the page and they will all be gone... Cmd+R... A load more porn sites... Cmd+R again... Another load of porn sites... Damn!!


porn


So we have a bit of a laugh together about the dangers of live demos and how there was probably a joker in the audience who got me pretty good. All in all it's not really a bad experience and everyone there seemed to have a good laugh at my expense, which I'm totally cool with. I thought that was the end of it until the following day I noticed that the site was still scattered with porn sites.


Digging deeper

At the time I'd thought it was just someone in the audience and didn't pay it any further though, but for this still to be the case a whole day later, that's not a coincidence. Time to start digging. The first thing I hit was my server logs to see who was making these requests. Perhaps the joker had left a cron job running to fetch the results for porn sites!


porn


Holy crap that's a lot of scan results with porn or xxx in them alone! I didn't even try to expand the search out into some of the more, err, exotic(?) terms that I was seeing in some of the domain names! Straight away though it was obvious what the culprit was. Almost all of the requests were coming from the YandexBot and a very small number from the GoogleBot.


robots.txt

In my robots.txt I do allow crawlers to index the search results page because that's where most of the value of the site is. The difference now was that I was suddenly seeing a huge interest in results pages on Security Headers, but for some reason they were all for some pretty interesting porn sites. I didn't want to block the crawlers because there is value in them indexing these pages so instead I implemented a simple change.


Don't let crawlers put items on the front page

When Security Headers performs a scan there are some protections in place to prevent abuse. Multiple scans against the same remote host are limited and don't result in multiple assessments nor do they count in the statistics either. There are also some basic limits on what can go into the Recent Scans section or the Hall of Fame or Hall of Shame too. All it needed was a simple addition:


if (strpos($_SERVER['HTTP_USER_AGENT'], 'bot') !== false) {
    return true;
}

This little snippet of PHP will check to see if the User Agent string of the calling client contains the string bot and if it does, it returns out of the current function that adds the scan to the list on the homepage. I had a quick look around at some of the most common UA strings in the world and this shouldn't catch out anyone with a remotely mainstream browser it seems. Fingers crossed this was a pretty simple solution and you can now navigate back to Security Headers without having to worry about learning about whole new worlds of porn sites!