I'm involved in a migration project that requires identification of local
information in millions of MARC records.
The master records I need to compare with are 14GB total. I don't know what
the others will be, but since the masters are deduped and the source files
aren't (plus they contain loads of other garbage), there will be
considerably more. Roughly speaking, if I compare 1000 master records per
second, it would take about 2 1/2 hours to cut through the file. I need to
be able to ask the file whatever questions the librarians might have (i.e.
many), so speed is important.
For reasons I won't go into right now, I'm stuck doing this on my laptop in
cygwin right now and that affects my range of motion.
I'm trying to figure out the best way to proceed. Currently, I'm extracting
specific fields for comparison. Each field tag gets a single line keyed by
OCLC number (repeated fields are catted together with a delimiter). The
idea is that if I deal with only one field at a time, I can slurp the
master info in memory and retrieve it via hash (OCLC control number) as I
loop through the comparison data. Local data will either be stored in
special files that are loaded separately from the bibs or recorded in
reports for maintenance projects
This process is clunky because a special comparison file has to be created
for each question, but it does seem to work (generating preprocess files
and then doing the compare is measured in minutes rather than hours). I
didn't use a DB because there's no way I could store the reference data in
memory and I figured I'd just thrash my drive.
Is this a reasonable approach, and whether or not it is, what tools should
I be thinking of using for this? Thanks,
kyle
|