I know there are two parts of this discussion (speed on the one hand, applicability/features on teh other), but for the former, running a little benchmark just isn't that hard. Aren't we supposed to, you know, prefer to make decisions based on data? Note: I'm only testing deserialization because there's isn't, as of now, a fast serialization option for ruby-marc. It uses REXML, and it's dog-slow. I already looked marc-in-json vs marc binary at http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/ Benchmark Source: http://gist.github.com/645683 18,883 records as either an XML collection or newline-delimited json. Open the file, read every record, pull out a title. Repeat 5 times for a total of 94,415 records (i.e., just under 100K records total). Under ruby-marc, using the libxml deserializer is the fastest option. If you're using the REXML parser, well, god help us all. ruby 1.8.7 (2010-08-16 patchlevel 302) [i686-darwin9.8.0]. User time reported in seconds. xml w/libxml 227 seconds marc-in-json w/yajl 130 seconds So....quite a bit faster (more than 40%). For a million records (assuming I can just say 10*these_values) you're talking about a difference of 16 minutes due to just reading speed. Assuming, of course, you're running your code on my desktop. Today. For the 8M records I have to deal with, that'd be roughly 8M * ((227-130) / 94,415) = 7806 seconds, or about 130 minutes. Soooo...a lot. Of course, if you're using a slower XML library or a slower JSON library, your numbers will vary quite a bit. REXML is unforgivingly slow, and json/pure (and even 'json') are quite a bit slower than yajl. And don't forget that you need to serialize these things from your source somehow... -Bill- On Mon, Oct 25, 2010 at 4:23 PM, Stephen Meyer <[log in to unmask]>wrote: > Kyle Banerjee wrote: > >> On Mon, Oct 25, 2010 at 12:38 PM, Tim Spalding <[log in to unmask]> >> wrote: >> >> Does processing speed of something matter anymore? You'd have to be >>> doing a LOT of processing to care, wouldn't you? >>> >>> >> Data migrations and data dumps are a common use case. Needing to break or >> make hundreds of thousands or millions of records is not uncommon. >> >> kyle >> > > To make this concrete, we processes the MARC records from 14 separate ILS's > throughout the University of Wisconsin System. We extract, sort on OCLC > number, dedup and merge pieces from any campus that has a record for the > work. The MARC that we then index and display here > > http://forward.library.wisconsin.edu/catalog/ocm37443537?school_code=WU > > is not identical to the version of the MARC record from any of the 4 > schools that hold it. > > We extract 13 million records and dedup down to 8 million every week. Speed > is paramount. > > -sm > -- > Stephen Meyer > Library Application Developer > UW-Madison Libraries > 436 Memorial Library > 728 State St. > Madison, WI 53706 > > [log in to unmask] > 608-265-2844 (ph) > > > "Just don't let the human factor fail to be a factor at all." > - Andrew Bird, "Tables and Chairs" > -- Bill Dueber Library Systems Programmer University of Michigan Library