LISTSERV 16.5 - CODE4LIB Archives

When we have patrons that try to download tens or hundreds of thousands 
of pages -- not uncommonly, the vendor has software that notices the 
'excessive' use, sends us an email reminding us that bulk downloading 
violates our terms of service, and temporarily blacklists the IP address 
(which could become more of a problem as we move to NAT/PAT where 
everyone appears to the external internet as one of only a few external 
IPs).

Granted, these users are usually downloading actual PDFs, not just 
citations.  I'm not really sure if when they are doing it for personal 
research of some kind, or when they are doing it to share with off-shore 
'pirate research paper' facilities (I'm not even making that up), but 
the volume of use that triggers the vendors notices is such that it's 
definitely an automated process of some kind, not just someone clicking 
a lot.

Bulk downloading from our content vendors is usually prohibited by their 
terms of service. So, beware.

On 11/14/13 10:30 AM, Eric Lease Morgan wrote:
> Thank you for the replies, and after a bit of investigation I learned that I don’t need to do authentication because the vendor does IP authentication. Nice! On the other hand, I was still not able to resolve my original problem.
>
> I needed/wanted to download ten’s of thousands, if not hundred’s of thousands of citations for text mining analysis. The Web interface to the database/index limits output to 4,000 items and selecting the set of these items is beyond tedious — it is cruel and unusual punishment. I then got the idea of using EndNote’s z39.50 client, and after a bit of back & forth I got it working, but the downloading process was too slow. I then got the bright idea of writing my own z39.50 client (below). Unfortunately, I learned that the 4,000 record limit is more than that. A person can only download the first 4,000 records in a found set. Requests for record 4001, 4002, etc. fail. This is true in my locally written client as well as in EndNote.
>
> Alas, it looks as if I am unable to download the data I need/require, unless somebody at the vendor give me a data dump. On the other hand, since my locally written client is so short and simple, I think I can create a Web-based interface to query many different z39.50 targets and provide on-the-fly text mining analysis against the results.
>
> In short, I learned a great many things.
>
> —
> Eric Lease Morgan
> University of Notre Dame
>
>
> #!/usr/bin/perl
>
> # nytimes-search.pl - rudimentary z39.50 client to query the NY Times
>
> # Eric Lease Morgan <[log in to unmask]>
> # November 13, 2013 - first cut; "Happy Birthday, Steve!"
>
> # usage: ./nytimes-search.pl > nytimes.marc
>
>
> # configure
> use constant DB     => 'hnpnewyorktimes';
> use constant HOST   => 'fedsearch.proquest.com';
> use constant PORT   => 210;
> use constant QUERY  => '@attr 1=1016 "trade or tariff"';
> use constant SYNTAX => 'usmarc';
>
> # require
> use strict;
> use ZOOM;
>
> # do the work
> eval {
>
> 	# connect; configure; search
> 	my $conn = new ZOOM::Connection( HOST, PORT, databaseName => DB );
> 	$conn->option( preferredRecordSyntax => SYNTAX );
> 	my $rs = $conn->search_pqf( QUERY );
>
> 	# requests > 4000 return errors
> 	# print $rs->record( 4001 )->raw;
> 			
> 	# retrieve; will break at record 4,000 because of vendor limitations
> 	for my $i ( 0 .. $rs->size ) {
> 	
> 		print STDERR "\tRetrieving record #$i\r";
> 		print $rs->record( $i )->raw;
> 		
> 	}
> 		
> };
>
> # report errors
> if ( $@ ) { print STDERR "Error ", $@->code, ": ", $@->message, "\n" }
>
> # done
> exit;
>
>