On a related note:
I've recently been made aware of Glaze and Nightshade, two AI models
designed to protect images from being scraped as training data. Both
introduce subtle artifacting into the publically available image, confusing
AIs that are trained on data scraped from public sources without
permission. Glaze simply makes the style impossible to understand, whereas
Nightshade actively poisons the model into mis-categorizing styles.
2kliksphilip has a good overview of it here:
https://www.youtube.com/watch?v=nDrCC2Uee3k
Subtlety aside, it does alter the original image. The video above mentions
using it on publicly available posts for artists, and paywalling the
original... which is not as applicable for public repositories. Perhaps any
free account could access the original?
I'm also unsure as to how long the alteration process takes, and what
support they have for batch editing tools.
If anyone's gone deeper than me into this rabbit hole, please share your
thoughts!
- Kaleb
On Tue, Apr 9, 2024 at 12:05 AM Stefano Bargioni <[log in to unmask]> wrote:
> The URBE Consortium is planning to apply fail2ban rules to detect and
> block high rate accesses.
> sb
>
> --
> Dott. Stefano Bargioni
> Pontificia Universita' della Santa Croce - Roma
> Vicedirettore della Biblioteca
> <mailto:[log in to unmask]> <http://www.pusc.it>
> --- "Non refert quam multos habeas libros, sed bonos" (Seneca) ---
>
>
> > On 9 Apr 2024, at 00:15, Bruce Orcutt <
> [log in to unmask]> wrote:
> >
> > also following as also been find some crawlers being less and less
> behaved, ignoring robots.txt, scanning fast enough to impact performance,
> etc. used to just be a handful of badly behaving bots but definitely
> growing of late.
> >
> > Bruce Orcutt
> > UTSA Libraries: Systems
> > (210) 458- 6192
> > ________________________________
> > From: Code for Libraries <[log in to unmask]> on behalf of Jason
> Casden <[log in to unmask]>
> > Sent: Monday, April 8, 2024 3:30:14 PM
> > To: [log in to unmask] <[log in to unmask]>
> > Subject: [EXTERNAL] Re: [CODE4LIB] blocking GPTBot?
> >
> > **EXTERNAL EMAIL**
> > This email originated outside of The University of Texas at San Antonio.
> > Please exercise caution when clicking on links or opening attachments.
> >
> >
> >
> > Thanks for bringing this up, Eben. We've been having a horrible time with
> > these bots, including those from previously fairly well-behaved sources
> > like Google. They've caused issues ranging from slow response times and
> > high system load all the way up to outages for some older systems. So
> far,
> > our systems folks have been playing whack-a-mole with a combination of IP
> > range blocks and increasingly detailed robots.txt statements. A group is
> > being convened to investigate more comprehensive options so I will be
> > watching this thread closely.
> >
> > Jason
> >
> > On Mon, Apr 8, 2024 at 4:18 PM Eben English <[log in to unmask]>
> wrote:
> >
> >> Hi all,
> >>
> >> I'm wondering if other folks are seeing AI and/or ML-related crawlers
> like
> >> GPTBot accessing your library's website, catalog, digital collections,
> or
> >> other sites.
> >>
> >> If so, are you blocking or disallowing these crawlers? Has anyone come
> up
> >> with any policies around this?
> >>
> >> We're debating whether to allow these types of bots to crawl our digital
> >> collections, many of which contain large amounts of copyrighted or "no
> >> derivatives"-licensed materials. On one hand, these materials are
> available
> >> for public view, but on the other hand the type of use that GPTBot and
> the
> >> like are after (integrating the content into their models) could be
> >> characterized as creating a derivative work, which is expressly
> >> discouraged.
> >>
> >> Thanks,
> >>
> >> Eben English (he/him/his)
> >> Digital Repository Services Manager
> >> Boston Public Library
> >>
>
|