Brad et al, We use wget scripts to back up our internet archive pages, which, oddly enough, are the instructions given by internet archive itself. :/ Kenny Ketner Information Products Lead Montana State Library 406-444-2870 [log in to unmask] kennyketner.com On Tue, Nov 28, 2017 at 12:31 PM, Brett <[log in to unmask]> wrote: > Yes, I did ask, and ask, and ask, and waited for 2 months. There was > something political going on internally with that group that was well > beyond my pay grade. > > I did explain the potential problems to my boss and she was providing > cover. > > I did it in batches as Google Sheets limits the amount of ImportXML that > you can do in a 24 hour span, so I wasn't hammering anyone's web server > into oblivion. > > It's funny, I actually had to do a fair amount to get the old V1 LibGuides > link checker to stop hammering my ILS into going offline back in 2010-2011. > > > > On Tue, Nov 28, 2017 at 2:18 PM, Bill Dueber <[log in to unmask]> wrote: > > > Brett, did you ask the folks at the Large University Library if they > could > > set something up for you? I don't have a good sense of how other > > institutions deal with things like this. > > > > In any case, I know I'd much rather talk about setting up an API or a > > nightly dump or something rather than have my analytics (and bandwidth!) > > blown by a screen scraper. I might say "no," but at least it would be an > > informed "no" :-) > > > > On Tue, Nov 28, 2017 at 2:08 PM, Brett <[log in to unmask]> > wrote: > > > > > I leveraged the IMPORTXML() and xpath features in Google Sheets to pull > > > information from a large university website to help create a set of > > weeding > > > lists for a branch campus. They needed extra details about what was in > > > off-site storage and what was held at the central campus library. > > > > > > This was very much like Jason's FIFO API, the central reporting group > had > > > sent me a spreadsheet with horrible data that I would have had to sort > > out > > > almost completely manually, but the call numbers were pristine. I used > > the > > > call numbers as a key to query the catalog with limits for each campus > I > > > needed to check, and then it dumped all of the necessary content > > (holdings, > > > dates, etc) into the spreadsheet. > > > > > > I've also used Feed43 as a way to modify certain RSS feeds and scrape > > > websites to only display the content I want. > > > > > > Brett Williams > > > > > > > > > On Tue, Nov 28, 2017 at 1:24 PM, Brad Coffield < > > > [log in to unmask]> > > > wrote: > > > > > > > I think there's likely a lot of possibilities out there and was > hoping > > to > > > > hear examples of web scraping for libraries. Your example might just > > > > inspire me or another reader to do something similar. At the very > > least, > > > > the ideas will be interesting! > > > > > > > > Brad > > > > > > > > > > > > -- > > > > Brad Coffield, MLIS > > > > Assistant Information and Web Services Librarian > > > > Saint Francis University > > > > 814-472-3315 > > > > [log in to unmask] > > > > > > > > > > > > > > > -- > > Bill Dueber > > Library Systems Programmer > > University of Michigan Library > > >