On Jul 24, 2009, at 9:43 PM, raj kumar wrote:
>> Over the next few years, I am tasked to download 30,000 archival
>> masters from Internet Archive into an archive for long-term staff
>> access that we may preserve with LOCKSS. These are masters of
>> Montana state publications. I have a hierarchy in mind to receive
>> these files. The hierarchy is state agency\year\title\pub_date\*.pdf.
>>
>> I am intending to download the files in batches of 200 - 500 pdfs,
>> but am thinking that if I slot them automatically into the archive
>> hierarchy, misplaced or missing files could be very hard to find as
>> the total grows. I will be logging the downloads, which should give
>> me some control. Are there other strategies for ensuring that I can
>> readily correct download errors? I am looking for recommendations
>> for the simplest way to maintain reasonable control over the
>> download process.
>
> A couple things:
>
> If you already have archive.org identifiers picked out, you can use
> something like this python script to download them all from IA:
> http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/
'Sounds fun, and such a project is something I advocate not only for
retrospective preservation purposes put for general collection
building as well, but that is another story.
Without some sort of metadata it will not be possible for you to save
your files in the hierarchy outlined above. State agency. Year. Title.
Publication date. One the other hand, if metadata containing these
values is readily accessible in the downloaded file itself or, as Ed
mentioned, a part of some sort of manifest (or MARC record), then you
are golden. I used Raj's script as a model for a similar process [1]:
* write a cool query against Open Library returning identifiers
* feed identifiers to mirroring program; I used wget
* download file as well as metadata
* parse metadata and process associated file accordingly
If you're really luck, then the "cool query" written against Open
Library will also return the necessary metadata and you could use that
as a guide to save your file
Good luck.
[1] similar process - http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/
--
Eric Lease Morgan
|