Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.
According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:
1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop
2. contain a distribution of kinds of records, e.g., books, CDs,
musical scores, DVDs, special collection items, etc.
3. contain a distribution of languages, so we can test unicode handling
4. contain holdings information in addition to bib records
5. contain a distribution of typical errors one might encounter with
marc records in the wild
It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.
Since code4lib is my lazyweb, I'm asking you:
1. Does something like this exist already and I just don't know about
it?
2. If not, do you have suggestions on how to go about making such a
data set? I have some ideas on how to do it bit by bit, and we have a
certain small set of records that we're already using for testing,
but maybe there's a better method that I don't know about?
3. Are there features missing from the above list that would make
this more useful?
Thoughts? Comments?
Thanks!
Bess
Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904
[log in to unmask]
(434) 243-2305
|