I'd be interested in seeing the code.
Currently, a very manual approach, looking at bad traffic patterns manually, or some via splunk, then manual IP or host agent bans via apache configuration files, then passing on to our CyberOps team for evaluation for a campus wide ban.
Bruce Orcutt
UTSA Libraries: Systems
(210) 458 – 6192
________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Demian Katz <[log in to unmask]>
Sent: Friday, March 28, 2025 8:25 AM
To: [log in to unmask] <[log in to unmask]>
Subject: [EXTERNAL] Re: [CODE4LIB] [EXTERNAL] Re: [CODE4LIB] Bot scraping/DDoS against ILS and discovery layers
**EXTERNAL EMAIL**
This email originated outside of The University of Texas at San Antonio.
Please exercise caution when clicking on links or opening attachments.
We've built a local PHP-based "firewall" that we can just include into other PHP applications to block unwanted traffic matching unwanted patterns. This has proven effective when we need to respond to an incident quickly and can't wait for the upstream information security team to make changes to the "real" firewalls. 😊 I'm happy to share code if anyone is interested in this approach. I don't claim that it's very sophisticated, but neither are most of the bots that we are currently fighting.
- Demian
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Tod Olson
Sent: Wednesday, March 26, 2025 10:08 AM
To: [log in to unmask]
Subject: [EXTERNAL] Re: [CODE4LIB] Bot scraping/DDoS against ILS and discovery layers
I can also say that we've seen a fair amount of this sort of scraping/DDoS, it's been happening since late December. (We've also had one or two incidents in that timeframe of harvesting coming from single IPs, which of course are easier to deal with.) We're a FOLIO shop running VuFind locally, and have also seen similar scraping/DDoS against our image database.
We have also implemented Cloudflare's Turnstile to good effect. We are also exploring some Web Application Firewall options, in case things evolve past the point of where Turnstile is effective.
Best,
-Tod
Tod Olson <[log in to unmask]> (he/him)
Director of Integrated Library Systems
University of Chicago Library
Local Host Committee, Open Repositories 2025<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2For2025.openrepositories.org%2F&data=05%7C02%7Cbruce.orcutt%40UTSA.EDU%7Cb61a293423474e8fc53f08dd6dfc0861%7C3a228dfbc64744cb88357b20617fc906%7C0%7C0%7C638787651434535304%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=HQF0fRguqPl28KsY7cB5CHCcH%2FS9hZWMiYrGupo5jXs%3D&reserved=0<https://or2025.openrepositories.org/>>
[Image.png]
On Mar 26, 2025, at 6:56 AM, Esmé Cowles <[log in to unmask]> wrote:
Eric-
We have seen a lot of bot traffic in the last few weeks, and we are a Clarivate (Alma) shop, though our discovery layer is Blacklight. Something we've noticed as we've tried to block the bot traffic, is that the spikes of bot activity that have been DOSing us for many months now is only part of the picture, and we actually have a very high baseline level of bot activity at all times. So much so that we're reconsidering our analytics picture because so much of our recent historical traffic is undetected bots (e.g., in one report China represented about 90% of our traffic).
We've also heard of similar levels of problems from digital collections and other kinds of sites (e.g., SourceHut https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstatus.sr.ht%2Fissues%2F2025-03-17-git.sr.ht-llms%2F&data=05%7C02%7Cbruce.orcutt%40UTSA.EDU%7Cb61a293423474e8fc53f08dd6dfc0861%7C3a228dfbc64744cb88357b20617fc906%7C0%7C0%7C638787651434567188%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=RKSyoxASv%2B1bq1s2yzxFPDIP14UKxY%2BYIxDsp%2Bu4%2BYI%3D&reserved=0)<https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/>. So my general impression is that this isn't targeted at one technology stack or libraries, but is basically everybody with any content on the internet.
The thing we've implemented recently, which is the first thing that's been really successful is using Turnstile. Jonathan Rochind wrote up this approach:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbibwild.wordpress.com%2F2025%2F01%2F16%2Fusing-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app%2F&data=05%7C02%7Cbruce.orcutt%40UTSA.EDU%7Cb61a293423474e8fc53f08dd6dfc0861%7C3a228dfbc64744cb88357b20617fc906%7C0%7C0%7C638787651434585166%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=WAO5wQn79Ro6eMYyoS5cP%2FLojdsy5fokQOLv61pOIwQ%3D&reserved=0<https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app/>
And we adapted that to our setup using Traefik:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpulibrary%2Fprinceton_ansible%2Ftree%2Fmain%2Fnomad%2Ftraefik-wall&data=05%7C02%7Cbruce.orcutt%40UTSA.EDU%7Cb61a293423474e8fc53f08dd6dfc0861%7C3a228dfbc64744cb88357b20617fc906%7C0%7C0%7C638787651434599553%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=FsootNbCZFCYKu007LtoY7jq0xLg88jGDyD3CfxcqOI%3D&reserved=0<https://github.com/pulibrary/princeton_ansible/tree/main/nomad/traefik-wall>
There has been a fair amount of discussion of this on the Code4Lib and Samvera Slack workspaces (in the #bots channel in each), so I'd encourage anyone who's battling this to check those out.
-Esmé
--
Esmé Cowles <[log in to unmask]>
Asst. Director, Library Software Engineering Princeton University Library
On Mar 26, 2025, at 7:26 AM, Eric Blevins <[log in to unmask]> wrote:
Good morning,
First time posting to Code4Lib, but have been a watcher for several years. I'm curious from strictly a numbers standpoint how many libraries might've been impacted recently (say the last couple of weeks or so) by massive bot harvesting of data, basically resulting in a DDoS attack, against your ILS, Discovery Layers, or other systems. I'm actually also curious if non-Innovative/Clarivate product libraries are seeing similar issues. We are an innovative/Clarivate product shop, so we have some awareness that others with those products were impacted. Again, aside from curiosity if you're a non-Clarivate shop, I'm not looking for specifics just wondering about the scope of the attacks against other institutions/orgs.
Regards,
Eric C. Blevins
Sr. Manager of Library Technology
RIT Libraries
Rochester Institute of Technology
Email: [log in to unmask]<mailto:[log in to unmask]>
|