I know there are two parts of this discussion (speed on the one hand,
applicability/features on teh other), but for the former, running a little
benchmark just isn't that hard. Aren't we supposed to, you know, prefer to
make decisions based on data?
Note: I'm only testing deserialization because there's isn't, as of now, a
fast serialization option for ruby-marc. It uses REXML, and it's dog-slow. I
already looked marc-in-json vs marc binary at
http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/
Benchmark Source: http://gist.github.com/645683
18,883 records as either an XML collection or newline-delimited json.
Open the file, read every record, pull out a title. Repeat 5 times for a
total of 94,415 records (i.e., just under 100K records total).
Under ruby-marc, using the libxml deserializer is the fastest option. If
you're using the REXML parser, well, god help us all.
ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin9.8.0]. User time
reported in seconds.
xml w/libxml 227 seconds
marc-in-json w/yajl 130 seconds
So....quite a bit faster (more than 40%). For a million records (assuming I
can just say 10*these_values) you're talking about a difference of 16
minutes due to just reading speed. Assuming, of course, you're running your
code on my desktop. Today.
For the 8M records I have to deal with, that'd be roughly 8M * ((227-130)
/ 94,415) = 7806 seconds, or about 130 minutes. Soooo...a lot.
Of course, if you're using a slower XML library or a slower JSON library,
your numbers will vary quite a bit. REXML is unforgivingly slow, and
json/pure (and even 'json') are quite a bit slower than yajl. And don't
forget that you need to serialize these things from your source somehow...
-Bill-
On Mon, Oct 25, 2010 at 4:23 PM, Stephen Meyer <[log in to unmask]>wrote:
> Kyle Banerjee wrote:
>
>> On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding <[log in to unmask]>
>> wrote:
>>
>> Does processing speed of something matter anymore? You'd have to be
>>> doing a LOT of processing to care, wouldn't you?
>>>
>>>
>> Data migrations and data dumps are a common use case. Needing to break or
>> make hundreds of thousands or millions of records is not uncommon.
>>
>> kyle
>>
>
> To make this concrete, we processes the MARC records from 14 separate ILS's
> throughout the University of Wisconsin System. We extract, sort on OCLC
> number, dedup and merge pieces from any campus that has a record for the
> work. The MARC that we then index and display here
>
> http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU
>
> is not identical to the version of the MARC record from any of the 4
> schools that hold it.
>
> We extract 13 million records and dedup down to 8 million every week. Speed
> is paramount.
>
> -sm
> --
> Stephen Meyer
> Library Application Developer
> UW-Madison Libraries
> 436 Memorial Library
> 728 State St.
> Madison, WI 53706
>
> [log in to unmask]
> 608-265-2844 (ph)
>
>
> "Just don't let the human factor fail to be a factor at all."
> - Andrew Bird, "Tables and Chairs"
>
--
Bill Dueber
Library Systems Programmer
University of Michigan Library
|