I agree with Kyle that a big, wide set of records is better for testing purposes. In processing records for Evergreen imports, I've found that there are often just a handful that throw marc4j for a loop. I suppose I should cull those and attach them to bug reports... instead I've taken the path of least resistance and just used yaz-marcdump. (bad Dan!)
There are, of course, _lots_ of MARC records available for download from
http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29
- not just the LoC set. So one could presumably assemble a nice big set of records starting here.
Dan
>>> On Fri, May 9, 2008 at 12:33 PM, Bess Sadler <[log in to unmask]> wrote:
> Those of us involved in the Blacklight and VuFind projects are
> spending lots of time recently thinking about marc records indexing.
> We're about to start running some performance tests, and we want to
> create unit tests for our marc to solr indexer, and also people
> wanting to download and play with the software need to have easy
> access to a small but representative set of marc records that they
> can play with.
>
> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop
> 2. contain a distribution of kinds of records, e.g., books, CDs,
> musical scores, DVDs, special collection items, etc.
> 3. contain a distribution of languages, so we can test unicode handling
> 4. contain holdings information in addition to bib records
> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild
>
> It seems to me that the set that Casey donated to Open Library
> (http://www.archive.org/details/marc_records_scriblio_net) would be a
> good place from which to draw records, because although IANAL, this
> seems to sidestep any legal hurdles. I'd also love to see the ability
> for the community to contribute test cases. Assuming such a set
> doesn't exist already (see my question below) this seems like the
> ideal sort of project for code4lib to host, too.
>
> Since code4lib is my lazyweb, I'm asking you:
>
> 1. Does something like this exist already and I just don't know about
> it?
> 2. If not, do you have suggestions on how to go about making such a
> data set? I have some ideas on how to do it bit by bit, and we have a
> certain small set of records that we're already using for testing,
> but maybe there's a better method that I don't know about?
> 3. Are there features missing from the above list that would make
> this more useful?
>
> Thoughts? Comments?
>
> Thanks!
> Bess
>
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
> [log in to unmask]
> (434) 243- 2305
|