In my copious spare time over the past week I have created a bigger
index; more fun with KinoSearch:
http://juice.library.nd.edu/sru/sru-client.html
My short-term goal is to identify a robust indexer. My medium-term
goal is to demonstrate how a new & improved library catalog could
function. To accomplish these goals I have begun experimenting more
with KinoSearch and content from Project Gutenberg. First I
downloaded the RDF representation of Project Gutenberg content -- all
44 MB of it. I then parsed the RDF and cached what I needed to a
(MyLibrary) database. I then looped through each Project Gutenberg
record -- all 24,000 of them -- downloaded the full text of each
item, and fed the whole thing to KinoSearch. After a couple of fits
and starts the whole process took about five hours, but I had to
scale back my experiment to include only 7,500 records because I ran
out of disk space.
Here are a few observations:
1. Parsing the Gutenberg RDF was a pain because the metadata was
inconsistent and overly complex. On the other hand I enhanced my SAX
skills.
2. After extracting the necessary metadata and using LWP as a user-
agent to acquire the full text, it was fun creating a new set of RDF
files and feeding them to KinoSearch. This process was surprisingly
fast. I've always been amazed how much data an indexer can create in
such a short period of time.
3. KinoSearch requires a lot of extra disk space in order to
optimize. I scaled back my experiment a few times, and the last time
I squeaked by with less than a MB to spare. When optimization was
complete I given back more than 10 GB of disk space. BTW, my index is
7 GB in size.
4. Searching the index through a my Perl-based SRU interface
proved to be painless. Compared response times with and without the
SRU interface seemed negligible.
5. Freetext searches take a long time to execute, longer than most
people will be willing to wait. At the same time, my hardware is not
very big. It is considered to be a hand-me-down.
6. Indexed fields (title, creator, subject, etc.), whose content
is much smaller than the free text field, respond *much* quicker. Nice.
7. The creation of excerpts is nice and does not seem to hinder
performance too much. Unfortunately, my browser does not render any
included HTML sent back in the SRU response so I am limited to things
like * and >< combinations in my output. This will be resolved in an
implementation of a less XML-centric interface.
8. Precision-recall is not the greatest because there is too much
noise in the full text. For example, searches for "north carolina"
return too much irrelevant stuff. Also Project Gutenberg tricked me.
While they give me a link to a text file this link is often
redirected to a splash page. Grrr...
My next steps involve:
* adding other types of content to the index such as MARC records,
Wikipedia articles, the full text of open access journals, and things
harvestable via OAI
* creating a smarter client application that will allow people to
limit search results by format, suggesting alternative queries
through the use of dictionaries and thesauri, and providing other
enhanced services besides find, identify, and acquire
* locating a bigger piece of hardware where I can save and index
more content
Wish me luck.
P.S. My long-term goal is a facilitate whirled peas.
--
Eric Lease Morgan
University Libraries of Notre Dame
I'm hiring a Senior Programmer Analyst.
See http://dewey.library.nd.edu/morgan/programmer/.
|