[Since you're getting good performance using a relational database, these may not be necessary for you, but since I've been looking at some of the tricks I've used in my own code to see how they can be fitted into the revived marc4j project, I thought I'd write them down] If the Tennant/Dean principle holds, fast processing of MARC records is important. Here are some tips for computing over large collections of marc records. *1. Compression: * Depending on how fast or slow your disks are, and how your disks are configured, you may benefit from using compression to reduce the amount of data that has to be read. With mirrored replicas, the cold cache times for uncompressed files becomes competitive, especially if memory mapped files are used, and only part of the record is needed. Decompression typically requires accessing the whole file, and the tables used for decompression put pressure on the CPU cache. Also, since the decompression process is using CPU, that CPU time is not available for compute intensive work. On a laptop, or over a SAN with more contention, compression becomes more advantageous, especially as the amount of processing power increases relative to available I/O bandwidth. Using 7,030,372 LC records from scriblio as a test set, uncompressed 5.4GB (5645008KB) 1:1 gzip 1.7GB (1753716KB) 3.2:1 lzma 982MB (1005396KB) 5.6:1 Compression is almost always a win with MARC-XML; however, MARC-XML is generally to be avoided when performance is a consideration. xml 16GB xml.lzma 980MB On an Linux i7 server with 16GB of memory, using a single eSata drive, we see: uncompressed : time ( for i in 0 1 2 3 4 5 6 7 8 9; do (cat dat/*$i.dat* | wc -c)& done ; wait) cold: 1m20.874s / warm: 0m0.483s gzip: cold: 0m20.950s / warm: 0m7.174s lzma: cold: 0m18.306s / warm: 0m11.913s On a MacbookPro (i7, 8GB memory), uncompressed: 2m7.319s (data too big to cache) (single process, 1m33.348s) gzip: 0m30.622s / 0m18.024s (data fits in cache) lzma: 0m28.239s / 0m26.642s (data fits in cache) *2. Use sorted files. * If all of the records of interest in all the files have a common identifier, sort the files using that identifier. You can then process all of the local batches in parallel, accessing each record only once. Assume that the master file is complete. Open the master file and all local files. Read the first record from each local file. While at least one local file is open Find the lowest record-id for all open local files Advance the master file until the current master record has the same id as the lowest local-id (if a record-id is found that is greater than the lowest local-id, then that id is missing from the master) For every open local file, IF the local record-id matches the current record-id from the master file find and output all differences between this local record and the master record move to the next record ; close the file if no records left This approach is more or less a traditional merge. . Simon On Mon, Mar 4, 2013 at 1:01 PM, Kyle Banerjee <[log in to unmask]>wrote: > > After trying a few experiments, it appears that my use case (mostly > comparing huge record sets with an even bigger record set of records on > indexed points) is well suited to a relational model. My primary goal is to > help a bunch of libraries migrate to a common catalog so the primary thing > people are interested in knowing is what data is local to their catalog. > > Identifying access points and relevant description in their catalog that > are not in the master record involves questions like "Give me a list of > records where field X occurs more times in our local catalog than in the > master record (or that value is missing from the master record -- thank > goodness for LEFT JOIN)" so that arrangements can be made. > > I'm getting surprising performance and the convenience of being able to do > everything from the command line is nice. >