Good Morning,
I wanted to just express my appreciation for everyone who has responded in this thread. Thank you for sharing and for providing some great information and options. A quick note back to Tim at UNC regarding the survey, our App admin for our ILS has been in touch with one of the admins there previously, and I confirmed we did participate in the survey. Thank you for putting that together.
Again, there's been some great information to look through with my team and again, I appreciate all of the current (and future) replies.
Regards,
Eric C. Blevins (He/Him/His)
Sr. Manager of Library Technology
RIT Libraries
Rochester Institute of Technology
Email: [log in to unmask]
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Joe Hourclé
Sent: Wednesday, March 26, 2025 11:27 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Bot scraping/DDoS against ILS and discovery layers
> On Mar 26, 2025, at 7:26 AM, Eric Blevins <[log in to unmask]> wrote:
>
> Good morning,
>
> First time posting to Code4Lib, but have been a watcher for several years. I'm curious from strictly a numbers standpoint how many libraries might've been impacted recently (say the last couple of weeks or so) by massive bot harvesting of data, basically resulting in a DDoS attack ...
(resending, as it looks like my response this morning got rejected for not having a plain-text alternative)
I know you said you were only trying to figure out the scope of the issue, and I can't answer your main question (as I'm not currently working), but I've dealt with similar issues at both an ISP in the 90s (when the issue was email harvesting), and at NASA archive (when it was a combination of attacks, people automating tasks in an abusive way, and the occasional 'crackpot story about UFOs circling the sun got mentioned by a website in china') which we had to balance against legitimate "the sun did something interesting, and now everyone wants a copy of the data"... so I have 25+ years of dealing with abusive traffic.
For the ISP, if we identified a suspected misbehaving bot (either manually (seeing patterns and adding their IP or user_agent to some apache mod_rewrite rules), or the stuff denied in /robots.txt had a hidden link in , I would to send them to a CGI that would:
1. Post a text warning that they were suspected of being a spam harvester, and needed to stop 2. randomly generate e-mail addresses, with a sleep just long enough so they didn't time out, but didn't bog down our servers 3. The email addresses were all bogus for the most part, but were of three types:
a. host name did not resolve
b. a spam trap that I used for tuning the ISP's spam filters c. a host name that resolved to an IP on our network that would blackhole at the router any IP address that connected to it.
... There have been reports that Cloudflare recently started doing something similar to poison AI harvesters, serving them AI generated true-but-useless information: https://arstechnica.com/ai/2025/03/cloudflare-turns-ai-against-itself-with-endless-maze-of-irrelevant-facts/
...
For the NASA situation, most webservers have a way to set rate-limiting these days, but in the early days, we relied on iptables / ipfilters / ipf or similar.
Checking my (pretty old) notes that I had written for people configuring data servers:
__BEGIN__
Slowing Abusive Parallel Downloading
There are modules that allow you to do rate limiting within the webserver, but if you have a machine using IPTables, you can limit a given IP address to only 5 connections at once using:
-A INPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN -m connlimit --connlimit-above 5 --connlimit-mask 32 -j REJECT --reject-with tcp-reset You can also set limits per IP block by reducing --connlimit-mask. Use --connlimit-mask=24 for a 256 IP address block.
__END__
(this is 10 years old; I got into trouble after pushing back against the HTTPS-Only thing, as it screwed up our intrusion detection system, so never updated my notes for HTTPS or IPv6 ... look up 'connlimit-mask' examples and you should find stuff.
...
The "lots of legitimate traffic" was more difficult (and unless you have a blatant UserAgent string, DDoS and cloud harvesting are going to look similar). As the servers got bogged down, some of them would switch over to a 'low resource' mode where they'd stop serving 'branding' type images (switched to header & footers that didn't try to wrap the content in pretty borders and such)
...
I also did a fair bit of database and webserver/website tuning... but some of it took a while as you had to talk people into changing their sites... like convincing our web designer to stop using image with rollovers for every damned menu item on the sidebar, and replacing it with CSS. (she finally agreed when it came time to change the menu, and realized my way meant that she didn't have to re-generate everything and spend a day on it, she could just add an item in the list)
...there have been changes in webdesign practices in the 20 years since that have improved things some, but users having higher bandwidth means that a lot of people don't even attempt to optimize their sites anymore, so they'll just load a crapload of javascript files that they don't use or large frameworks for just a single function from it.
I also try to serve as much static content as possible... in the early days of Fark, we actually had a CGI to splice new items onto the page, rather than dynamically generate it.
When you can't do fully static content, check to see how much caching your webserver can do, or put it behind a caching proxy. It won't help with harvesters (which are attempting to look at every last page), but will help with the accidental DDoS when the 'UFOs as big as the earth!' crazies hit your site.
Depending on your situation, it might also make sense to split out highly requested static content (eg, branding images, CSS, javascript, and highly requested data) and your 'long tail' data (static, but infrequently requested) to different servers (even if just additional services running in the same machine), as they can then be tuned differently. You may want to use nginx or some other reverse proxy so that it still looks like one server to the outside world (gives you a single point to monitor, and keeps from breaking existing links (https://www.w3.org/Provider/Style/URI))
-Joe
(not currently affiliated)
/also got crushed by problem traffic when I worked for a university in the early 90s //but that was because a staff member had decided to start a porn site on our student/staff webserver ///and had to fight with the folks who ran the registrar's site to actually tune their databases as almost all freshman registered for their classes on one of two days during orientation
|