Print

Print


As Kyle mentioned, a screenscraping method is inefficient and will
get you incomplete results.  As a vendor to public libraries, I routinely
request (and receive) MARC dumps.  Some libraries are better than
others at pulling these from their ILS, but records based on MARC come
from the Library of Congress and are therefore public information -- to
which you are entitled if you reside in the US.  A number of libraries
make dumps available through various Open Data initiatives -- spotty but
can be useful.  Screenscraping can be good for spot-checking, but if
you want a complete catalog, working with an ILS administrator is, in
my view, a better path.

peter


On Sun, Nov 28, 2021 at 2:25 AM Kyle Banerjee <[log in to unmask]>
wrote:

> A screen scrape harvest will likely generate more search activity than all
> actual searches combined while being really slow. Assuming a process where
> every call gets you a non duplicated record, a catalog of a relatively
> modest 2.5 million records would take you a month to retrieve at one record
> per second.
>
> So I also think reaching out is a better approach. Even if the robots file
> doesn't disallow it, would be better form to work with them and ask for a
> dump. This will be way more efficient, get you better and more complete
> info, and not carry a risk of getting locked out. Btw, it is common for
> libraries to load enormous numbers of records for subscription content or
> other types of resources that might not be relevant to your project.
>
> Kyle
>
> On Fri, Nov 26, 2021, 02:32 Stefano Bargioni <[log in to unmask]> wrote:
>
> > My policy: contact the library manager LM and ask for the pace to use.
> > Even better: use library dumps or ask to periodically publish the data
> you
> > need, so to be compliant with the 3rd star of the semantic web.
> > This will avoid any scraping :-)
> > No way to contact the LM? Try with a very slow pace, then reduce the
> delay
> > while querying the opac itself, to see if its performance is affected by
> > your scrape.
> > Bye. Stefano
> >
> > > On 25 Nov 2021, at 20:54, M Belvadi <[log in to unmask]> wrote:
> > >
> > > Hi, all.
> > >
> > > What do you all think about code that screenscapes (eg python's
> Beautiful
> > > Soup) library opacs?
> > > Is it ok to do?
> > > Ok if it's throttled to a specific rate of hits per minute?
> > > Ok, if throttled AND is a really big library system where the load
> might
> > > not be relatively significant?
> > >
> > > Not entirely unrelated, is there an API for the new University of
> > > California Library Search system?
> > >
> > >
> > > Melissa Belvadi
> > > [log in to unmask]
> > >
> >
>


-- 

Peter Velikonja
Head of Research, Koios LLC
http://www.koios.co
*Growing library awareness with the power of Search*