The Blacklight code is not currently using XML or XSLT. It's indexing binary MARC files. I don't know it's speed, but I hear it's pretty fast. But for the kind of test set I want, even waiting half an hour is too long. I want a test set where I can make a change to my indexing configuration and then see the results in a few minutes, if not seconds. This has become apparent in my attempts to get the indexer _working_, where I might not be actually changing the index mapping at all, I'm just changing the indexer configuration until I know it's working. When I get to actually messing with indexer mapping to try out new ideas, I believe it will also be important, however. Jonathan Casey Durfee wrote: > I strongly agree that we need something like this. The LoC records that > Casey donated are a great resource but far from ideal from this purpose. > They're pretty homogeneous. I do think it needs to be bigger than 10,000 > though. 100,000 would be a better target. And I would like to see a > UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser > handle DANMARC's "ø" subfield?). > > I don't know about Blacklight or VuFind, but using our MarcThing package + > Solr we can index up to 1000 records a second. I know using XSLT severely > limits how fast you can index (I'll refrain from giving another rant about > how wrong it is to use XSL to handle MARC -- the Society for Prevention of > Cruelty to Dead Horses has my number as it is.) But I'd still expect you > can do a good 50-100 records a second. That's only a half hour to an hour > of work to index 100,000 records. You could run it over your lunch break. > Seems reasonable to me. > > In addition to a wide variety of languages, encodings, formats and so forth, > it would definitely need to have records explicitly designed to break > things. Blank MARC tags, extraneous subfield markers, non-printing control > characters, incorrect length fixed fields etc. The kind of stuff that > should never happen in theory, but happens frequently in real life (I'm > looking at you, Horizon's MARC export utility). > > The legal aspect of this is the difficult part. We (LibraryThing) could > easily grab 200 random records from 500 different Z39.50 sources worldwide. > Technically, it could be done in a couple of hours. Legally, I don't think > it could ever be done, sadly. > > --Casey > > > On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[log in to unmask]> wrote: > > >> Those of us involved in the Blacklight and VuFind projects are >> spending lots of time recently thinking about marc records indexing. >> We're about to start running some performance tests, and we want to >> create unit tests for our marc to solr indexer, and also people >> wanting to download and play with the software need to have easy >> access to a small but representative set of marc records that they >> can play with. >> >> According to the combined brainstorming of Jonathan Rochkind and >> myself, the ideal record set should: >> >> 1. contain about 10k records, enough to really see the features, but >> small enough that you could index it in a few minutes on a typical >> desktop >> 2. contain a distribution of kinds of records, e.g., books, CDs, >> musical scores, DVDs, special collection items, etc. >> 3. contain a distribution of languages, so we can test unicode handling >> 4. contain holdings information in addition to bib records >> 5. contain a distribution of typical errors one might encounter with >> marc records in the wild >> >> It seems to me that the set that Casey donated to Open Library >> (http://www.archive.org/details/marc_records_scriblio_net) would be a >> good place from which to draw records, because although IANAL, this >> seems to sidestep any legal hurdles. I'd also love to see the ability >> for the community to contribute test cases. Assuming such a set >> doesn't exist already (see my question below) this seems like the >> ideal sort of project for code4lib to host, too. >> >> Since code4lib is my lazyweb, I'm asking you: >> >> 1. Does something like this exist already and I just don't know about >> it? >> 2. If not, do you have suggestions on how to go about making such a >> data set? I have some ideas on how to do it bit by bit, and we have a >> certain small set of records that we're already using for testing, >> but maybe there's a better method that I don't know about? >> 3. Are there features missing from the above list that would make >> this more useful? >> >> Thoughts? Comments? >> >> Thanks! >> Bess >> >> >> Elizabeth (Bess) Sadler >> Research and Development Librarian >> Digital Scholarship Services >> Box 400129 >> Alderman Library >> University of Virginia >> Charlottesville, VA 22904 >> >> [log in to unmask] >> (434) 243-2305 >> >> > > -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu