Print

Print


Following on what others have said, I would suggest stepping back and asking what you're going to get from the screenscraped records... are you interested in content or design? Screenscrapes won't help you with content the way MARC records would since its just what's displayed and may not include all MARC fields (even if it should) and any comparison between libraries might see different MARC -> display field mappings. Are you studying holdings overlaps between libraries? Numbers of times a field is used in just one library? Or is it something for which the whole website is needed? We have some measures which attempt to shut down bots scraping ours too quickly or too deep (we're worried about search engine bots just trying to index every search result page).

As both the person responsible for how the catalog displays and having worked on occasion with screenscraped data, I can't imagine getting a ton of use out of a mass screenscrape compared to what I could do with the files.

Asking for extracts might be the way to go - though libraries can't necessarily share vended MARC records (the standard itself is open, much of the content created by other libraries is open, but vendors sometimes put restrictions on theirs. On a side note I sometimes wonder if it's less because they feel proprietary than that the records are often godawful in quality.). So if a library subscribes to 3 million ebooks and gets records for them, they may not be able to share most/all of those. But they may be happy to share the rest, give you a large sample, or point you to a way to download them. Most of our records are accessible via Z39.50, but I don't know if that really lends itself to this kind of mass search.

Your project sounds like you're looking for something interesting! I hope the suggestions in here are helpful in making it happen.

Ruth

My working day may not be your working day. Please don't feel obliged to reply to this e-mail outside of your normal working hours.

Ruth Kitchin Tillman
Sally W. Kalin Librarian for Technological Innovations
Assistant Librarian
Penn State University Libraries
Paterno Library 006
[log in to unmask]<mailto:[log in to unmask]>

she/her/hers