A screen scrape harvest will likely generate more search activity than all
actual searches combined while being really slow. Assuming a process where
every call gets you a non duplicated record, a catalog of a relatively
modest 2.5 million records would take you a month to retrieve at one record
per second.
So I also think reaching out is a better approach. Even if the robots file
doesn't disallow it, would be better form to work with them and ask for a
dump. This will be way more efficient, get you better and more complete
info, and not carry a risk of getting locked out. Btw, it is common for
libraries to load enormous numbers of records for subscription content or
other types of resources that might not be relevant to your project.
Kyle
On Fri, Nov 26, 2021, 02:32 Stefano Bargioni <[log in to unmask]> wrote:
> My policy: contact the library manager LM and ask for the pace to use.
> Even better: use library dumps or ask to periodically publish the data you
> need, so to be compliant with the 3rd star of the semantic web.
> This will avoid any scraping :-)
> No way to contact the LM? Try with a very slow pace, then reduce the delay
> while querying the opac itself, to see if its performance is affected by
> your scrape.
> Bye. Stefano
>
> > On 25 Nov 2021, at 20:54, M Belvadi <[log in to unmask]> wrote:
> >
> > Hi, all.
> >
> > What do you all think about code that screenscapes (eg python's Beautiful
> > Soup) library opacs?
> > Is it ok to do?
> > Ok if it's throttled to a specific rate of hits per minute?
> > Ok, if throttled AND is a really big library system where the load might
> > not be relatively significant?
> >
> > Not entirely unrelated, is there an API for the new University of
> > California Library Search system?
> >
> >
> > Melissa Belvadi
> > [log in to unmask]
> >
>
|