Not a library, but we run several library products and have several bookish
websites with many millions of pages.
* We've seen an overall rise in scraping over the last two years. We and
others attribute the rise to bots scraping for LLM development.
* We have anti-LLM stuff in our robots.txt, but it doesn't matter. The
problem is the bad actors.
* We put ourselves by Cloudflare several years ago after a multi-day DDoS
attack—a real one, with actual extortion demands. The rise of AI scraping
has meant we spend time tweaking our Cloudflare settings. CF is free, but
we pay for a higher-level of service.
* Much or most of the traffic is China and Singapore, which where a lot of
cloud-computing resources are located. On several occasions we'd literally
shut down all traffic from China, but, alas, we have a big customer in
Singapore.
* We reduced our attack surface. In our case this meant killing off our
many translated language sites (LibraryThing.fr, LibraryThing.de,
dk.LibraryThing.com) in favor of having language-pickers on the main site.
* Cloudflare has specific anti-AI filters, as well as a new "maze" feature
to lead bots on a merry chase forever.
Tim
On Wed, Mar 26, 2025 at 10:08 AM Tod Olson <[log in to unmask]> wrote:
> I can also say that we've seen a fair amount of this sort of
> scraping/DDoS, it's been happening since late December. (We've also had one
> or two incidents in that timeframe of harvesting coming from single IPs,
> which of course are easier to deal with.) We're a FOLIO shop running VuFind
> locally, and have also seen similar scraping/DDoS against our image
> database.
>
> We have also implemented Cloudflare's Turnstile to good effect. We are
> also exploring some Web Application Firewall options, in case things evolve
> past the point of where Turnstile is effective.
>
> Best,
>
> -Tod
>
> Tod Olson <[log in to unmask]> (he/him)
> Director of Integrated Library Systems
> University of Chicago Library
>
> Local Host Committee, Open Repositories 2025<
> https://or2025.openrepositories.org>
> [Image.png]
>
> On Mar 26, 2025, at 6:56 AM, Esmé Cowles <[log in to unmask]> wrote:
>
> Eric-
>
> We have seen a lot of bot traffic in the last few weeks, and we are a
> Clarivate (Alma) shop, though our discovery layer is Blacklight. Something
> we've noticed as we've tried to block the bot traffic, is that the spikes
> of bot activity that have been DOSing us for many months now is only part
> of the picture, and we actually have a very high baseline level of bot
> activity at all times. So much so that we're reconsidering our analytics
> picture because so much of our recent historical traffic is undetected bots
> (e.g., in one report China represented about 90% of our traffic).
>
> We've also heard of similar levels of problems from digital collections
> and other kinds of sites (e.g., SourceHut
> https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/). So my general
> impression is that this isn't targeted at one technology stack or
> libraries, but is basically everybody with any content on the internet.
>
> The thing we've implemented recently, which is the first thing that's been
> really successful is using Turnstile. Jonathan Rochind wrote up this
> approach:
>
>
> https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app/
>
> And we adapted that to our setup using Traefik:
>
> https://github.com/pulibrary/princeton_ansible/tree/main/nomad/traefik-wall
>
> There has been a fair amount of discussion of this on the Code4Lib and
> Samvera Slack workspaces (in the #bots channel in each), so I'd encourage
> anyone who's battling this to check those out.
>
> -Esmé
> --
> Esmé Cowles <[log in to unmask]>
> Asst. Director, Library Software Engineering
> Princeton University Library
>
> On Mar 26, 2025, at 7:26 AM, Eric Blevins <
> [log in to unmask]> wrote:
>
> Good morning,
>
> First time posting to Code4Lib, but have been a watcher for several years.
> I'm curious from strictly a numbers standpoint how many libraries might've
> been impacted recently (say the last couple of weeks or so) by massive bot
> harvesting of data, basically resulting in a DDoS attack, against your ILS,
> Discovery Layers, or other systems. I'm actually also curious if
> non-Innovative/Clarivate product libraries are seeing similar issues. We
> are an innovative/Clarivate product shop, so we have some awareness that
> others with those products were impacted. Again, aside from curiosity if
> you're a non-Clarivate shop, I'm not looking for specifics just wondering
> about the scope of the attacks against other institutions/orgs.
>
> Regards,
>
> Eric C. Blevins
> Sr. Manager of Library Technology
> RIT Libraries
> Rochester Institute of Technology
> Email: [log in to unmask]<mailto:[log in to unmask]>
>
>
--
Check out my library at https://www.librarything.com/profile/timspalding
|