Print

Print


On Feb 23, 2006, at 9:29 PM, Eric Lease Morgan wrote:

> I have mirrored 50 GB of journal data from the open access journal
> literature, and I'm curious to know, what indexer would you use to
> index this data?
>
> --
> Eric Lease Morgan
> University Libraries of Notre Dame


Hi, Eric.

What format is the data in? (e.g. xml? plain text?) Are you willing
to do coding yourself, or do you need an out-of-box solution?

Generally speaking, lucene is a very fast indexing solution, but it
is not something you can just download and use; you have a find an
application that builds on the lucene indexer. If your data is XML
you might want to take a look at the martini project at http://
sourceforge.net/projects/martini. It was built for Olive XML, but it
works with any xml schema. With some tweaking, it will let you
describe your xml documents, put them into a lucene index, and
search / display them via a cocoon application. It's very much in
development, though, so be prepared to shape it to fit your own needs.

I've also been experimenting with Berkeley DB's XML databases, and
those are very fast, too, and would handle the kind of size you're
talking about. One of my co-workers recently figured out how to pull
documents directly from Berkeley DB XML into a cocoon pipeline, so
that's handy, too. If this is of interest, I can send you some code
that you could use as a template.

Both of these assume the documents are in XML, and that you're
willing to do a lot of coding.

Another option might be eXist. It's fairly easy to get up and
running, and it can be more of an out-of-the-box solution (as long as
you're prepared to open a complicated box) to index, search, and
display documents, but I've heard it can be unstable for very large
data sets like the one you're talking about.

I'd love to hear about other people's recent experiences with eXist.
At my library we keep wanting to use it, but having been burned by a
previous release we're feeling a little hesitant.

Cheers,
Bess


Elizabeth (Bess) Sadler
Metadata Specialist for User Projects

Digital Research and Instructional Services (DRIS)
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[log in to unmask]
(434) 243-2305