Brad et al,
We use wget scripts to back up our internet archive pages, which, oddly
enough, are the instructions given by internet archive itself. :/
Kenny Ketner
Information Products Lead
Montana State Library
406-444-2870
[log in to unmask]
kennyketner.com
On Tue, Nov 28, 2017 at 12:31 PM, Brett <[log in to unmask]> wrote:
> Yes, I did ask, and ask, and ask, and waited for 2 months. There was
> something political going on internally with that group that was well
> beyond my pay grade.
>
> I did explain the potential problems to my boss and she was providing
> cover.
>
> I did it in batches as Google Sheets limits the amount of ImportXML that
> you can do in a 24 hour span, so I wasn't hammering anyone's web server
> into oblivion.
>
> It's funny, I actually had to do a fair amount to get the old V1 LibGuides
> link checker to stop hammering my ILS into going offline back in 2010-2011.
>
>
>
> On Tue, Nov 28, 2017 at 2:18 PM, Bill Dueber <[log in to unmask]> wrote:
>
> > Brett, did you ask the folks at the Large University Library if they
> could
> > set something up for you? I don't have a good sense of how other
> > institutions deal with things like this.
> >
> > In any case, I know I'd much rather talk about setting up an API or a
> > nightly dump or something rather than have my analytics (and bandwidth!)
> > blown by a screen scraper. I might say "no," but at least it would be an
> > informed "no" :-)
> >
> > On Tue, Nov 28, 2017 at 2:08 PM, Brett <[log in to unmask]>
> wrote:
> >
> > > I leveraged the IMPORTXML() and xpath features in Google Sheets to pull
> > > information from a large university website to help create a set of
> > weeding
> > > lists for a branch campus. They needed extra details about what was in
> > > off-site storage and what was held at the central campus library.
> > >
> > > This was very much like Jason's FIFO API, the central reporting group
> had
> > > sent me a spreadsheet with horrible data that I would have had to sort
> > out
> > > almost completely manually, but the call numbers were pristine. I used
> > the
> > > call numbers as a key to query the catalog with limits for each campus
> I
> > > needed to check, and then it dumped all of the necessary content
> > (holdings,
> > > dates, etc) into the spreadsheet.
> > >
> > > I've also used Feed43 as a way to modify certain RSS feeds and scrape
> > > websites to only display the content I want.
> > >
> > > Brett Williams
> > >
> > >
> > > On Tue, Nov 28, 2017 at 1:24 PM, Brad Coffield <
> > > [log in to unmask]>
> > > wrote:
> > >
> > > > I think there's likely a lot of possibilities out there and was
> hoping
> > to
> > > > hear examples of web scraping for libraries. Your example might just
> > > > inspire me or another reader to do something similar. At the very
> > least,
> > > > the ideas will be interesting!
> > > >
> > > > Brad
> > > >
> > > >
> > > > --
> > > > Brad Coffield, MLIS
> > > > Assistant Information and Web Services Librarian
> > > > Saint Francis University
> > > > 814-472-3315
> > > > [log in to unmask]
> > > >
> > >
> >
> >
> >
> > --
> > Bill Dueber
> > Library Systems Programmer
> > University of Michigan Library
> >
>
|