In my copious spare time over the past week I have created a bigger index; more fun with KinoSearch: http://juice.library.nd.edu/sru/sru-client.html My short-term goal is to identify a robust indexer. My medium-term goal is to demonstrate how a new & improved library catalog could function. To accomplish these goals I have begun experimenting more with KinoSearch and content from Project Gutenberg. First I downloaded the RDF representation of Project Gutenberg content -- all 44 MB of it. I then parsed the RDF and cached what I needed to a (MyLibrary) database. I then looped through each Project Gutenberg record -- all 24,000 of them -- downloaded the full text of each item, and fed the whole thing to KinoSearch. After a couple of fits and starts the whole process took about five hours, but I had to scale back my experiment to include only 7,500 records because I ran out of disk space. Here are a few observations: 1. Parsing the Gutenberg RDF was a pain because the metadata was inconsistent and overly complex. On the other hand I enhanced my SAX skills. 2. After extracting the necessary metadata and using LWP as a user- agent to acquire the full text, it was fun creating a new set of RDF files and feeding them to KinoSearch. This process was surprisingly fast. I've always been amazed how much data an indexer can create in such a short period of time. 3. KinoSearch requires a lot of extra disk space in order to optimize. I scaled back my experiment a few times, and the last time I squeaked by with less than a MB to spare. When optimization was complete I given back more than 10 GB of disk space. BTW, my index is 7 GB in size. 4. Searching the index through a my Perl-based SRU interface proved to be painless. Compared response times with and without the SRU interface seemed negligible. 5. Freetext searches take a long time to execute, longer than most people will be willing to wait. At the same time, my hardware is not very big. It is considered to be a hand-me-down. 6. Indexed fields (title, creator, subject, etc.), whose content is much smaller than the free text field, respond *much* quicker. Nice. 7. The creation of excerpts is nice and does not seem to hinder performance too much. Unfortunately, my browser does not render any included HTML sent back in the SRU response so I am limited to things like * and >< combinations in my output. This will be resolved in an implementation of a less XML-centric interface. 8. Precision-recall is not the greatest because there is too much noise in the full text. For example, searches for "north carolina" return too much irrelevant stuff. Also Project Gutenberg tricked me. While they give me a link to a text file this link is often redirected to a splash page. Grrr... My next steps involve: * adding other types of content to the index such as MARC records, Wikipedia articles, the full text of open access journals, and things harvestable via OAI * creating a smarter client application that will allow people to limit search results by format, suggesting alternative queries through the use of dictionaries and thesauri, and providing other enhanced services besides find, identify, and acquire * locating a bigger piece of hardware where I can save and index more content Wish me luck. P.S. My long-term goal is a facilitate whirled peas. -- Eric Lease Morgan University Libraries of Notre Dame I'm hiring a Senior Programmer Analyst. See http://dewey.library.nd.edu/morgan/programmer/.