LISTSERV 16.5 - CODE4LIB Archives

Howdy Brad,

Jason nailed it on the head. Scraping is what you're reduced to when API's,
extractions, DB calls, shipping drives, mounting data on shared
infrastructure and the like aren't viable options. Also, scraping sometimes
gets you precombined or preprocessed data that would otherwise be a pain to
generate.

I find your question interesting. I avoid scraping like the plague as it
gives me heartburn just thinking about it -- i.e. I'm much more inclined to
figure out how not to use the method rather than how to use it.

Having said that, I have personally used scraping to migrate ILS and
digital collections data, identify corrupted digital assets on systems,
verify embargo compliance, and generate ILL pull lists sorted in correct
order with availability. I expect to resort to scraping to solve some
consortial collection analysis problems that cannot be solved using
provided analytical tools and extracts in the not too distant future.  The
Orbis Cascade Alliance used to rely on web scraping to support consortial
borrowing among dozens of standalone systems that did not support NCIP.

In addition to the obvious stability issues, there are a number of other
issues to be mindful of when scraping such as the method may violate TOS,
look like a DOS attack, be very slow and/or resource intensive, get mucked
up by spider traps (including unintentional ones), and be much
harder/easier depending on what headers you send. Before harvesting someone
else's systems, be sure to call and make sure they're cool with it and that
there's not some other undocumented mechanism that will serve you better.

Web scraping is all about parsing and cleaning, and the best method/tools
will vary with the specific application. As is the case with many "hacky"
methods, it's fun to do despite its deficiencies. And it works better than
one would think -- you'd be surprised how reliable a process that scrapes
millions of pages can be if you set it up right.

kyle


On Tue, Nov 28, 2017 at 10:59 AM, Jason Bengtson <[log in to unmask]>
wrote:

> I use web scraping sometimes to extract data from systems that lack APIs.
> I'm doing this to get current library job openings from our University jobs
> application, for instance. I use the structure of their website in a way
> similar to an API query, scrape the results, and extract only what I need.
> I jokingly call it a FIFIO API (Fine, I'll Figure It Out). Obviously, such
> a tool is inherently unstable, and has to be closely managed. When used
> with things like the jobs application, which maintain a relatively stable
> uri structure over time, however, it can be a pretty good tool when you
> have nothing else. I also used screen scraping as part of a tool I built
> years ago to allow authorized staff to create announcements within a
> special libguide that they then pushed to the EZ Proxy login page. I wrote
> a book chapter on that one:   "Leveraging LibGuides as an EZProxy
> Notifications Interface." Innovative Libguides Applications: Real World
> Examples. New York: Rowman & Littlefield, 2016
>
> Best regards,
>
> *Jason Bengtson*
>
>
> *http://www.jasonbengtson.com/ <http://www.jasonbengtson.com/>*
>
> On Tue, Nov 28, 2017 at 12:24 PM, Brad Coffield <
> [log in to unmask]
> > wrote:
>
> > I think there's likely a lot of possibilities out there and was hoping to
> > hear examples of web scraping for libraries. Your example might just
> > inspire me or another reader to do something similar. At the very least,
> > the ideas will be interesting!
> >
> > Brad
> >
> >
> > --
> > Brad Coffield, MLIS
> > Assistant Information and Web Services Librarian
> > Saint Francis University
> > 814-472-3315
> > [log in to unmask]
> >
>