Kyle -- if this was me -- I'd break the file into a database. You have a lot of different options, but the last time I had to do something like this -- I broke the data into 10 tables -- a control table with a primary key and oclc number, a table for 0xx fields, a table for 1xx, 2xx, etc. including OCLC number and key that they relate too. You can actually do this with MarcEdit (if you have mysql installed) -- but on a laptop -- I'm not going to guarantee speed with the process. Plus, the process to generate the SQL data will be significant. It might take 15 hours to generate the database, but then you'd have it and could create indexes on it. But you could use it to create the database and then prep the files for later work.
--TR
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Kyle Banerjee
Sent: Wednesday, February 27, 2013 9:45 AM
To: [log in to unmask]
Subject: [CODE4LIB] Slicing/dicing/combining large amounts of data efficiently
I'm involved in a migration project that requires identification of local information in millions of MARC records.
The master records I need to compare with are 14GB total. I don't know what the others will be, but since the masters are deduped and the source files aren't (plus they contain loads of other garbage), there will be considerably more. Roughly speaking, if I compare 1000 master records per second, it would take about 2 1/2 hours to cut through the file. I need to be able to ask the file whatever questions the librarians might have (i.e.
many), so speed is important.
For reasons I won't go into right now, I'm stuck doing this on my laptop in cygwin right now and that affects my range of motion.
I'm trying to figure out the best way to proceed. Currently, I'm extracting specific fields for comparison. Each field tag gets a single line keyed by OCLC number (repeated fields are catted together with a delimiter). The idea is that if I deal with only one field at a time, I can slurp the master info in memory and retrieve it via hash (OCLC control number) as I loop through the comparison data. Local data will either be stored in special files that are loaded separately from the bibs or recorded in reports for maintenance projects
This process is clunky because a special comparison file has to be created for each question, but it does seem to work (generating preprocess files and then doing the compare is measured in minutes rather than hours). I didn't use a DB because there's no way I could store the reference data in memory and I figured I'd just thrash my drive.
Is this a reasonable approach, and whether or not it is, what tools should I be thinking of using for this? Thanks,
kyle
|