I strongly agree that we need something like this. The LoC records that
Casey donated are a great resource but far from ideal from this purpose.
They're pretty homogeneous. I do think it needs to be bigger than 10,000
though. 100,000 would be a better target. And I would like to see a
UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser
handle DANMARC's "ø" subfield?).
I don't know about Blacklight or VuFind, but using our MarcThing package +
Solr we can index up to 1000 records a second. I know using XSLT severely
limits how fast you can index (I'll refrain from giving another rant about
how wrong it is to use XSL to handle MARC -- the Society for Prevention of
Cruelty to Dead Horses has my number as it is.) But I'd still expect you
can do a good 50-100 records a second. That's only a half hour to an hour
of work to index 100,000 records. You could run it over your lunch break.
Seems reasonable to me.
In addition to a wide variety of languages, encodings, formats and so forth,
it would definitely need to have records explicitly designed to break
things. Blank MARC tags, extraneous subfield markers, non-printing control
characters, incorrect length fixed fields etc. The kind of stuff that
should never happen in theory, but happens frequently in real life (I'm
looking at you, Horizon's MARC export utility).
The legal aspect of this is the difficult part. We (LibraryThing) could
easily grab 200 random records from 500 different Z39.50 sources worldwide.
Technically, it could be done in a couple of hours. Legally, I don't think
it could ever be done, sadly.
--Casey
On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[log in to unmask]> wrote:
> Those of us involved in the Blacklight and VuFind projects are
> spending lots of time recently thinking about marc records indexing.
> We're about to start running some performance tests, and we want to
> create unit tests for our marc to solr indexer, and also people
> wanting to download and play with the software need to have easy
> access to a small but representative set of marc records that they
> can play with.
>
> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop
> 2. contain a distribution of kinds of records, e.g., books, CDs,
> musical scores, DVDs, special collection items, etc.
> 3. contain a distribution of languages, so we can test unicode handling
> 4. contain holdings information in addition to bib records
> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild
>
> It seems to me that the set that Casey donated to Open Library
> (http://www.archive.org/details/marc_records_scriblio_net) would be a
> good place from which to draw records, because although IANAL, this
> seems to sidestep any legal hurdles. I'd also love to see the ability
> for the community to contribute test cases. Assuming such a set
> doesn't exist already (see my question below) this seems like the
> ideal sort of project for code4lib to host, too.
>
> Since code4lib is my lazyweb, I'm asking you:
>
> 1. Does something like this exist already and I just don't know about
> it?
> 2. If not, do you have suggestions on how to go about making such a
> data set? I have some ideas on how to do it bit by bit, and we have a
> certain small set of records that we're already using for testing,
> but maybe there's a better method that I don't know about?
> 3. Are there features missing from the above list that would make
> this more useful?
>
> Thoughts? Comments?
>
> Thanks!
> Bess
>
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
> [log in to unmask]
> (434) 243-2305
>
|