I agree with Kyle that a big, wide set of records is better for testing purposes. In processing records for Evergreen imports, I've found that there are often just a handful that throw marc4j for a loop. I suppose I should cull those and attach them to bug reports... instead I've taken the path of least resistance and just used yaz-marcdump. (bad Dan!) There are, of course, _lots_ of MARC records available for download from http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29 - not just the LoC set. So one could presumably assemble a nice big set of records starting here. Dan >>> On Fri, May 9, 2008 at 12:33 PM, Bess Sadler <[log in to unmask]> wrote: > Those of us involved in the Blacklight and VuFind projects are > spending lots of time recently thinking about marc records indexing. > We're about to start running some performance tests, and we want to > create unit tests for our marc to solr indexer, and also people > wanting to download and play with the software need to have easy > access to a small but representative set of marc records that they > can play with. > > According to the combined brainstorming of Jonathan Rochkind and > myself, the ideal record set should: > > 1. contain about 10k records, enough to really see the features, but > small enough that you could index it in a few minutes on a typical > desktop > 2. contain a distribution of kinds of records, e.g., books, CDs, > musical scores, DVDs, special collection items, etc. > 3. contain a distribution of languages, so we can test unicode handling > 4. contain holdings information in addition to bib records > 5. contain a distribution of typical errors one might encounter with > marc records in the wild > > It seems to me that the set that Casey donated to Open Library > (http://www.archive.org/details/marc_records_scriblio_net) would be a > good place from which to draw records, because although IANAL, this > seems to sidestep any legal hurdles. I'd also love to see the ability > for the community to contribute test cases. Assuming such a set > doesn't exist already (see my question below) this seems like the > ideal sort of project for code4lib to host, too. > > Since code4lib is my lazyweb, I'm asking you: > > 1. Does something like this exist already and I just don't know about > it? > 2. If not, do you have suggestions on how to go about making such a > data set? I have some ideas on how to do it bit by bit, and we have a > certain small set of records that we're already using for testing, > but maybe there's a better method that I don't know about? > 3. Are there features missing from the above list that would make > this more useful? > > Thoughts? Comments? > > Thanks! > Bess > > > Elizabeth (Bess) Sadler > Research and Development Librarian > Digital Scholarship Services > Box 400129 > Alderman Library > University of Virginia > Charlottesville, VA 22904 > > [log in to unmask] > (434) 243- 2305