I agree with all of the below; for quite some time, we were reduced to scraping our Serial Solutions journal title search results for use in our Bento, because SS did not have a search or discovery API. Fortunately, the SS page was semantically very simple, and SS hadn’t changed the interface in literally 10-12 years, so it was a stable (as can be) choice. It is a viable option given the correct circumstances, and has its own advantages as indicated in the below statement. However, the results time for that page were extremely variable, but then again, APIs can have connectivity issues as well.
S
Steven Turner, MLIS
Manager, Web Technologies and Development, Assistant Professor
University Libraries
The University of Alabama<https://www.ua.edu/>
416 Gorgas Library | Box 870266, Tuscaloosa, AL 35487-0266
office 205-348-1638
steven.j.turner<mailto:[log in to unmask]>@ua.edu | http://www.lib.ua.edu/
[cid:[log in to unmask]]
<https://www.ua.edu/>
<https://www.ua.edu/>
On Nov 28, 2017, at 1:27 PM, Kyle Banerjee <[log in to unmask]<mailto:[log in to unmask]>> wrote:
Howdy Brad,
Jason nailed it on the head. Scraping is what you're reduced to when API's,
extractions, DB calls, shipping drives, mounting data on shared
infrastructure and the like aren't viable options. Also, scraping sometimes
gets you precombined or preprocessed data that would otherwise be a pain to
generate.
I find your question interesting. I avoid scraping like the plague as it
gives me heartburn just thinking about it -- i.e. I'm much more inclined to
figure out how not to use the method rather than how to use it.
Having said that, I have personally used scraping to migrate ILS and
digital collections data, identify corrupted digital assets on systems,
verify embargo compliance, and generate ILL pull lists sorted in correct
order with availability. I expect to resort to scraping to solve some
consortial collection analysis problems that cannot be solved using
provided analytical tools and extracts in the not too distant future. The
Orbis Cascade Alliance used to rely on web scraping to support consortial
borrowing among dozens of standalone systems that did not support NCIP.
In addition to the obvious stability issues, there are a number of other
issues to be mindful of when scraping such as the method may violate TOS,
look like a DOS attack, be very slow and/or resource intensive, get mucked
up by spider traps (including unintentional ones), and be much
harder/easier depending on what headers you send. Before harvesting someone
else's systems, be sure to call and make sure they're cool with it and that
there's not some other undocumented mechanism that will serve you better.
Web scraping is all about parsing and cleaning, and the best method/tools
will vary with the specific application. As is the case with many "hacky"
methods, it's fun to do despite its deficiencies. And it works better than
one would think -- you'd be surprised how reliable a process that scrapes
millions of pages can be if you set it up right.
kyle
On Tue, Nov 28, 2017 at 10:59 AM, Jason Bengtson <[log in to unmask]<mailto:[log in to unmask]>>
wrote:
I use web scraping sometimes to extract data from systems that lack APIs.
I'm doing this to get current library job openings from our University jobs
application, for instance. I use the structure of their website in a way
similar to an API query, scrape the results, and extract only what I need.
I jokingly call it a FIFIO API (Fine, I'll Figure It Out). Obviously, such
a tool is inherently unstable, and has to be closely managed. When used
with things like the jobs application, which maintain a relatively stable
uri structure over time, however, it can be a pretty good tool when you
have nothing else. I also used screen scraping as part of a tool I built
years ago to allow authorized staff to create announcements within a
special libguide that they then pushed to the EZ Proxy login page. I wrote
a book chapter on that one: "Leveraging LibGuides as an EZProxy
Notifications Interface." Innovative Libguides Applications: Real World
Examples. New York: Rowman & Littlefield, 2016
Best regards,
*Jason Bengtson*
*http://www.jasonbengtson.com/ <http://www.jasonbengtson.com/>*
On Tue, Nov 28, 2017 at 12:24 PM, Brad Coffield <
[log in to unmask]<mailto:[log in to unmask]>
wrote:
I think there's likely a lot of possibilities out there and was hoping to
hear examples of web scraping for libraries. Your example might just
inspire me or another reader to do something similar. At the very least,
the ideas will be interesting!
Brad
--
Brad Coffield, MLIS
Assistant Information and Web Services Librarian
Saint Francis University
814-472-3315
[log in to unmask]
|