LISTSERV 16.5 - CODE4LIB Archives

> 
> On Apr 11, 2024, at 12:47 PM, Paul R. Pival <[log in to unmask]> wrote:
> 
> May not be of much use if they're ignoring robots.txt, but I just tripped across the following fascinating resource that's attempting to provide "A List of Known AI Agents on the Internet" https://darkvisitors.com/


I don’t know if it would help, but decades ago when I worked for an ISP, the concern was email harvesting bots.

Our webserver was configured so that when it saw a misbehaving bot, it sent them to a CGI that returned stuff VERY slowly… lots of pauses in between lines of text, but not so slow that they would timeout and give up.  The response was filled with bogus email addresses and an occasional one that resolved to our spam blackhole (it collected the mail so we could identify patterns to filter on, then blacklisted the incoming IP address at the router).  Then the CGI would provide a couple of links back to itself with some random PATH_INFO so it looked like new links to follow.

If I recall correctly, the first page that they got was a warning page, saying that we thought they were a bot and misbehaving, which then gave them a link to the CGI (which had a disallow in robots.txt).  And the CGI return had a warning, too. 

You might be able to do something similar for misbehaving AI crawlers, serving them poisoned images and garbled text instead of bad links.  I’ve heard that AI companies are having issues finding clean training data, as using AI generated text and images messes up their training, so you can always use that, too.

And if you’re using iptables or similar, there are ways to restrict the number of active connections from a single IP address.  I typically use:

-A INPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN -m connlimit --connlimit-above 5 --connlimit-mask 32 -j REJECT --reject-with tcp-reset
‘Connlimit-mask’ is the number of bits to use when determine what gets grouped together for the rule, so a mask of 24 would be a class C subnet (256 addresses).

You will need to set similar rules if you’re also using IPv6.


For some other notes about blocking abusive clients in Apache, see https://docs.virtualsolar.org/wiki/WebserverSetup

(But it hasn’t been updated for a decade, so there might be other better ways now … and I don’t think I ever posted my homebrew mod_security rules online)

-Joe
(Unaffiliated)