I'm involved in a migration project that requires identification of local information in millions of MARC records. The master records I need to compare with are 14GB total. I don't know what the others will be, but since the masters are deduped and the source files aren't (plus they contain loads of other garbage), there will be considerably more. Roughly speaking, if I compare 1000 master records per second, it would take about 2 1/2 hours to cut through the file. I need to be able to ask the file whatever questions the librarians might have (i.e. many), so speed is important. For reasons I won't go into right now, I'm stuck doing this on my laptop in cygwin right now and that affects my range of motion. I'm trying to figure out the best way to proceed. Currently, I'm extracting specific fields for comparison. Each field tag gets a single line keyed by OCLC number (repeated fields are catted together with a delimiter). The idea is that if I deal with only one field at a time, I can slurp the master info in memory and retrieve it via hash (OCLC control number) as I loop through the comparison data. Local data will either be stored in special files that are loaded separately from the bibs or recorded in reports for maintenance projects This process is clunky because a special comparison file has to be created for each question, but it does seem to work (generating preprocess files and then doing the compare is measured in minutes rather than hours). I didn't use a DB because there's no way I could store the reference data in memory and I figured I'd just thrash my drive. Is this a reasonable approach, and whether or not it is, what tools should I be thinking of using for this? Thanks, kyle