The Blacklight code is not currently using XML or XSLT. It's indexing
binary MARC files. I don't know it's speed, but I hear it's pretty fast.
But for the kind of test set I want, even waiting half an hour is too
long. I want a test set where I can make a change to my indexing
configuration and then see the results in a few minutes, if not seconds.
This has become apparent in my attempts to get the indexer _working_,
where I might not be actually changing the index mapping at all, I'm
just changing the indexer configuration until I know it's working. When
I get to actually messing with indexer mapping to try out new ideas, I
believe it will also be important, however.
Jonathan
Casey Durfee wrote:
> I strongly agree that we need something like this. The LoC records that
> Casey donated are a great resource but far from ideal from this purpose.
> They're pretty homogeneous. I do think it needs to be bigger than 10,000
> though. 100,000 would be a better target. And I would like to see a
> UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser
> handle DANMARC's "ø" subfield?).
>
> I don't know about Blacklight or VuFind, but using our MarcThing package +
> Solr we can index up to 1000 records a second. I know using XSLT severely
> limits how fast you can index (I'll refrain from giving another rant about
> how wrong it is to use XSL to handle MARC -- the Society for Prevention of
> Cruelty to Dead Horses has my number as it is.) But I'd still expect you
> can do a good 50-100 records a second. That's only a half hour to an hour
> of work to index 100,000 records. You could run it over your lunch break.
> Seems reasonable to me.
>
> In addition to a wide variety of languages, encodings, formats and so forth,
> it would definitely need to have records explicitly designed to break
> things. Blank MARC tags, extraneous subfield markers, non-printing control
> characters, incorrect length fixed fields etc. The kind of stuff that
> should never happen in theory, but happens frequently in real life (I'm
> looking at you, Horizon's MARC export utility).
>
> The legal aspect of this is the difficult part. We (LibraryThing) could
> easily grab 200 random records from 500 different Z39.50 sources worldwide.
> Technically, it could be done in a couple of hours. Legally, I don't think
> it could ever be done, sadly.
>
> --Casey
>
>
> On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[log in to unmask]> wrote:
>
>
>> Those of us involved in the Blacklight and VuFind projects are
>> spending lots of time recently thinking about marc records indexing.
>> We're about to start running some performance tests, and we want to
>> create unit tests for our marc to solr indexer, and also people
>> wanting to download and play with the software need to have easy
>> access to a small but representative set of marc records that they
>> can play with.
>>
>> According to the combined brainstorming of Jonathan Rochkind and
>> myself, the ideal record set should:
>>
>> 1. contain about 10k records, enough to really see the features, but
>> small enough that you could index it in a few minutes on a typical
>> desktop
>> 2. contain a distribution of kinds of records, e.g., books, CDs,
>> musical scores, DVDs, special collection items, etc.
>> 3. contain a distribution of languages, so we can test unicode handling
>> 4. contain holdings information in addition to bib records
>> 5. contain a distribution of typical errors one might encounter with
>> marc records in the wild
>>
>> It seems to me that the set that Casey donated to Open Library
>> (http://www.archive.org/details/marc_records_scriblio_net) would be a
>> good place from which to draw records, because although IANAL, this
>> seems to sidestep any legal hurdles. I'd also love to see the ability
>> for the community to contribute test cases. Assuming such a set
>> doesn't exist already (see my question below) this seems like the
>> ideal sort of project for code4lib to host, too.
>>
>> Since code4lib is my lazyweb, I'm asking you:
>>
>> 1. Does something like this exist already and I just don't know about
>> it?
>> 2. If not, do you have suggestions on how to go about making such a
>> data set? I have some ideas on how to do it bit by bit, and we have a
>> certain small set of records that we're already using for testing,
>> but maybe there's a better method that I don't know about?
>> 3. Are there features missing from the above list that would make
>> this more useful?
>>
>> Thoughts? Comments?
>>
>> Thanks!
>> Bess
>>
>>
>> Elizabeth (Bess) Sadler
>> Research and Development Librarian
>> Digital Scholarship Services
>> Box 400129
>> Alderman Library
>> University of Virginia
>> Charlottesville, VA 22904
>>
>> [log in to unmask]
>> (434) 243-2305
>>
>>
>
>
--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu
|