[Since you're getting good performance using a relational database, these
may not be necessary for you, but since I've been looking at some of the
tricks I've used in my own code to see how they can be fitted into the
revived marc4j project, I thought I'd write them down]
If the Tennant/Dean principle holds, fast processing of MARC records is
important.
Here are some tips for computing over large collections of marc records.
*1. Compression: *
Depending on how fast or slow your disks are, and how your disks are
configured, you may benefit from using compression to reduce the amount
of data that has to be read.
With mirrored replicas, the cold cache times for uncompressed files becomes
competitive, especially if memory mapped files are used, and only part of
the record is needed. Decompression typically requires accessing the whole
file, and the tables used for decompression put pressure on the CPU cache.
Also, since the decompression process is using CPU, that CPU time is not
available for compute intensive work.
On a laptop, or over a SAN with more contention, compression becomes more
advantageous, especially as the amount of processing power increases
relative to available I/O bandwidth.
Using 7,030,372 LC records from scriblio as a test set,
uncompressed 5.4GB (5645008KB) 1:1
gzip 1.7GB (1753716KB) 3.2:1
lzma 982MB (1005396KB) 5.6:1
Compression is almost always a win with MARC-XML; however, MARC-XML is
generally to be avoided when performance is a consideration.
xml 16GB
xml.lzma 980MB
On an Linux i7 server with 16GB of memory, using a single eSata drive, we
see:
uncompressed :
time ( for i in 0 1 2 3 4 5 6 7 8 9; do (cat dat/*$i.dat* | wc -c)& done
; wait)
cold: 1m20.874s / warm: 0m0.483s
gzip:
cold: 0m20.950s / warm: 0m7.174s
lzma:
cold: 0m18.306s / warm: 0m11.913s
On a MacbookPro (i7, 8GB memory),
uncompressed:
2m7.319s (data too big to cache)
(single process, 1m33.348s)
gzip:
0m30.622s / 0m18.024s (data fits in cache)
lzma:
0m28.239s / 0m26.642s (data fits in cache)
*2. Use sorted files. *
If all of the records of interest in all the files have a common
identifier, sort the files using that identifier. You can then process
all of the local batches in parallel, accessing each record only once.
Assume that the master file is complete.
Open the master file and all local files.
Read the first record from each local file.
While at least one local file is open
Find the lowest record-id for all open local files
Advance the master file until the current master record has the same
id as the lowest local-id
(if a record-id is found that is greater than the lowest
local-id, then that id is missing from the master)
For every open local file,
IF the local record-id matches the current record-id from the
master file
find and output all differences between this local record
and the master record
move to the next record ; close the file if no records left
This approach is more or less a traditional merge. .
Simon
On Mon, Mar 4, 2013 at 1:01 PM, Kyle Banerjee <[log in to unmask]>wrote:
>
> After trying a few experiments, it appears that my use case (mostly
> comparing huge record sets with an even bigger record set of records on
> indexed points) is well suited to a relational model. My primary goal is to
> help a bunch of libraries migrate to a common catalog so the primary thing
> people are interested in knowing is what data is local to their catalog.
>
> Identifying access points and relevant description in their catalog that
> are not in the master record involves questions like "Give me a list of
> records where field X occurs more times in our local catalog than in the
> master record (or that value is missing from the master record -- thank
> goodness for LEFT JOIN)" so that arrangements can be made.
>
> I'm getting surprising performance and the convenience of being able to do
> everything from the command line is nice.
>
|