Blocking web crawlers

Detecting and blocking bad web crawlers can be done several ways, but just blocking them are a half measure.

Detecting bad web crawlers

It is a given that web crawlers visit your site, some are favorable to have around such as Googlebot and Bingbot as they enable search engine visibility.

Web crawlers causing access forbidden

Other provide little value, and then there are the useless ones that just use up bandwidth + fill up log files.

Detecting these robots require a litte effort as they pretend to be regular browsers while using fake user agents.

What exposes the rather silly ones is that they request only pages and never the associated images of each page, also they have a habit of doing what can be called page stuttering. This shows as repeat requests for a given page over time and if blocked these bad web crawlers will use multiple IPs for the same request resulting in tracks in the web servers log files (see image above).

Additionally bad crawlers using fake user agents will appear as normal browsers in the log file but very few will actually fire off any tracking JavaScript on pages. This is a positive thing, as they do not polute the data of digital analytics solutions in place, but also a clear sign of what they are.

How to block web crawlers

The first thing to sort out is to have an updated robots.txt file, but that only works with crawlers that have been built properly. Next, if you experience crawlers that ignore robots.txt settings, setup a htaccess file on the server to block crawlers based on user agent strings.

Finally with some bad crawlers you will need to brute force blocking on an IP address range basis. These are the ones that will show up in log files as 403 Access Forbidden errors when blocked (a subset of 403 errors). Add to the mix also automating the reporting of such IPs that trigger the 403s to sites such as AbuseIPDB in order to alert others.

Crypts of the bad crawlers

Bad webcrawler hitting 100% scoreA majority of the bad web crawlers appear to crawl out from China, Russia and the USA, typically hiding behind some collocation server park IP range or a VPS with multiple IPs, but there are many more countries that pop up once analysis is done. Large players like Amazon seem to be hosting such bad crawlers as well, so setting up blocking might require crafting block lists carefully in order not to break legitimate crawler traffic.

By doing the above blocking of bad crawlers over time a decline should be visible, but only if the bad crawler has an infinitesimal amount of logic built into it to not waste time hitting on sites that block it. That number will however be very very limited.

A core requirement for a hosted site is to have tools such as IP blocking available in the management console, this makes blocking IP ranges a breeze. In lack of that it's a dive into firewall rules.

External link:

Block bad web crawlers