User agent switching web crawler

Detecting a web crawler / bot hitting sites has challenges due to the efforts to hide such activites. User agent switching is one such evasion technique, but it all boils down to detection and abuse prevention in order to minimize unwanted traffic.

Evasive web crawlers

Hordes of web crawlers Web crawlers normally identify themselves as such with a distinct name in the user agent together with a link to more information about the bot.

Another characteristic is that a web crawler will check the robots.txt file prior to crawling, and as a result following the rules of crawling a site as per set specifications in the robots.txt file.

This isn't the case any longer as web crawlers are today used for many purposes, here are some samples but in no way a complete list:

Data scraping to feed datasets for AI training
Website content change monitoring
Detecting web attack vectors
Generic online fuckery

Evasive web crawler using user agent switching

Time spacing the web crawler requests to a website is one method used to "hide in the log files", however the many connections of an IP calling the same content is a dead give away.

Dodgy web crawler signatures

Being evasive by rotating user agent is another method, but 3 things makes such crawlers stand out:

They ignore requesting the robots.txt file and/or the settings in it
They do not load images on the requested pages
There is no ownership by pointing to a bot info URL

Conclusion is that web crawlers can run but they can't hide. Web server logs always provide insight.

Dodgy hosting companies home to shifty web crawlers

Shifty web crawler dungeons

Dodgy hosting companies is where many of the shifty web crawlers appear from, in some cases proxy servers from well known companies such as Google are abused.

Hosting companies such as OVH, Amazon, Hertzner etc etc seem apparently to have no clue at all what shenanigans their clients are up to. Then, there is also the issue with the barrage of requests coming from China seemingly unorganized.

Defence options

One quite usefull defence option is Fail2ban which, integrated with other API solutions, provides a method to block amoung otherthings shifty web crawlers. Using the .htaccess file of the web server specific deny or redirect settings can dispatch annoying web crawlers away from the server, a selection of IP/user agent/pattern based rules deals with them summarily.

Example of a .htaccess rule dealing with denied web crawlers:

# Permanent ban
RewriteCond %{HTTP_USER_AGENT} (facebookexternalhit|curl|Go-http-client|GuzzleHttp|python)
RewriteRule .* https://gtfo.webbanalys.se [L,NC,R=308]

The currently most annoying web crawler which abuses web servers with bulk requests is Bytespider but is in no way alone. Redirection as above is one way to quantify shifty web crawlers, the most simple method is to serve them a 403 result code.

External links:

Fail2ban - intrusion prevention software framework

OpenAI GPTbot is quite daft

---