I was stopping by just to share the same note you mentioned, Tim, after reading the mention to Cloudflare Turnstile brought up by Tod:
Trapping misbehaving bots in an AI Labyrinth
https://blog.cloudflare.com/ai-labyrinth/
And the good news is that it "is available on an opt-in basis to all customers, including the Free plan".
Disclaimer: big fan of Cloudflare services (free plan), use them across a lot of my own services.
Best,
- Filipe
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Tim Spalding
Sent: 26 de março de 2025 14:26
To: [log in to unmask]
Subject: Re: [CODE4LIB] Bot scraping/DDoS against ILS and discovery layers
Not a library, but we run several library products and have several bookish websites with many millions of pages.
* We've seen an overall rise in scraping over the last two years. We and others attribute the rise to bots scraping for LLM development.
* We have anti-LLM stuff in our robots.txt, but it doesn't matter. The problem is the bad actors.
* We put ourselves by Cloudflare several years ago after a multi-day DDoS attack—a real one, with actual extortion demands. The rise of AI scraping has meant we spend time tweaking our Cloudflare settings. CF is free, but we pay for a higher-level of service.
* Much or most of the traffic is China and Singapore, which where a lot of cloud-computing resources are located. On several occasions we'd literally shut down all traffic from China, but, alas, we have a big customer in Singapore.
* We reduced our attack surface. In our case this meant killing off our many translated language sites (LibraryThing.fr, LibraryThing.de,
dk.LibraryThing.com) in favor of having language-pickers on the main site.
* Cloudflare has specific anti-AI filters, as well as a new "maze" feature to lead bots on a merry chase forever.
Tim
On Wed, Mar 26, 2025 at 10:08 AM Tod Olson <[log in to unmask]> wrote:
> I can also say that we've seen a fair amount of this sort of
> scraping/DDoS, it's been happening since late December. (We've also
> had one or two incidents in that timeframe of harvesting coming from
> single IPs, which of course are easier to deal with.) We're a FOLIO
> shop running VuFind locally, and have also seen similar scraping/DDoS
> against our image database.
>
> We have also implemented Cloudflare's Turnstile to good effect. We are
> also exploring some Web Application Firewall options, in case things
> evolve past the point of where Turnstile is effective.
>
> Best,
>
> -Tod
>
> Tod Olson <[log in to unmask]> (he/him)
> Director of Integrated Library Systems University of Chicago Library
>
> Local Host Committee, Open Repositories 2025<
> https://or2025.openrepositories.org>
> [Image.png]
>
> On Mar 26, 2025, at 6:56 AM, Esmé Cowles <[log in to unmask]> wrote:
>
> Eric-
>
> We have seen a lot of bot traffic in the last few weeks, and we are a
> Clarivate (Alma) shop, though our discovery layer is Blacklight.
> Something we've noticed as we've tried to block the bot traffic, is
> that the spikes of bot activity that have been DOSing us for many
> months now is only part of the picture, and we actually have a very
> high baseline level of bot activity at all times. So much so that
> we're reconsidering our analytics picture because so much of our
> recent historical traffic is undetected bots (e.g., in one report China represented about 90% of our traffic).
>
> We've also heard of similar levels of problems from digital
> collections and other kinds of sites (e.g., SourceHut
> https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/). So my general
> impression is that this isn't targeted at one technology stack or
> libraries, but is basically everybody with any content on the internet.
>
> The thing we've implemented recently, which is the first thing that's
> been really successful is using Turnstile. Jonathan Rochind wrote up
> this
> approach:
>
>
> https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to
> -protect-certain-pages-on-a-rails-app/
>
> And we adapted that to our setup using Traefik:
>
> https://github.com/pulibrary/princeton_ansible/tree/main/nomad/traefik
> -wall
>
> There has been a fair amount of discussion of this on the Code4Lib and
> Samvera Slack workspaces (in the #bots channel in each), so I'd
> encourage anyone who's battling this to check those out.
>
> -Esmé
> --
> Esmé Cowles <[log in to unmask]>
> Asst. Director, Library Software Engineering Princeton University
> Library
>
> On Mar 26, 2025, at 7:26 AM, Eric Blevins <
> [log in to unmask]> wrote:
>
> Good morning,
>
> First time posting to Code4Lib, but have been a watcher for several years.
> I'm curious from strictly a numbers standpoint how many libraries
> might've been impacted recently (say the last couple of weeks or so)
> by massive bot harvesting of data, basically resulting in a DDoS
> attack, against your ILS, Discovery Layers, or other systems. I'm
> actually also curious if non-Innovative/Clarivate product libraries
> are seeing similar issues. We are an innovative/Clarivate product
> shop, so we have some awareness that others with those products were
> impacted. Again, aside from curiosity if you're a non-Clarivate shop,
> I'm not looking for specifics just wondering about the scope of the attacks against other institutions/orgs.
>
> Regards,
>
> Eric C. Blevins
> Sr. Manager of Library Technology
> RIT Libraries
> Rochester Institute of Technology
> Email: [log in to unmask]<mailto:[log in to unmask]>
>
>
--
Check out my library at https://www.librarything.com/profile/timspalding
|