On Jul 24, 2009, at 2:20 PM, [Chris Stockwell] wrote:

> Over the next few years, I am tasked to download 30,000 archival  
> masters
> from Internet Archive into an archive for long-term staff access  
> that we may
> preserve with LOCKSS. These are masters of Montana state  
> publications. I
> have a hierarchy in mind to receive these files. The hierarchy is  
> state
> agency\year\title\pub_date\*.pdf.
> I am intending to download the files in batches of 200 - 500 pdfs,  
> but am
> thinking that if I slot them automatically into the archive  
> hierarchy, misplaced
> or missing files could be very hard to find as the total grows. I  
> will be logging
> the downloads, which should give me some control. Are there other  
> strategies
> for ensuring that I can readily correct download errors? I am  
> looking for
> recommendations for the simplest way to maintain reasonable control  
> over the
> download process.

A couple things:
If you already have identifiers picked out, you can use  
something like this python script to download them all from IA:

You can use the advanced search engine to produce xml,  
json, or csv file with all identifiers for a particular contributing  

eg. all identifier for Montana State Library ( 
) as an xml file (change rows=10 to rows=10000 to get them all):

Also, if you have an identifier, then you can get the  
files.xml that contains md5 and sha1 hashes, so you can verify your  

To pull the files.xml, use a /download/id/id_files.xml url. e.g.: