On Feb 23, 2006, at 9:29 PM, Eric Lease Morgan wrote: > I have mirrored 50 GB of journal data from the open access journal > literature, and I'm curious to know, what indexer would you use to > index this data? > > -- > Eric Lease Morgan > University Libraries of Notre Dame Hi, Eric. What format is the data in? (e.g. xml? plain text?) Are you willing to do coding yourself, or do you need an out-of-box solution? Generally speaking, lucene is a very fast indexing solution, but it is not something you can just download and use; you have a find an application that builds on the lucene indexer. If your data is XML you might want to take a look at the martini project at http:// sourceforge.net/projects/martini. It was built for Olive XML, but it works with any xml schema. With some tweaking, it will let you describe your xml documents, put them into a lucene index, and search / display them via a cocoon application. It's very much in development, though, so be prepared to shape it to fit your own needs. I've also been experimenting with Berkeley DB's XML databases, and those are very fast, too, and would handle the kind of size you're talking about. One of my co-workers recently figured out how to pull documents directly from Berkeley DB XML into a cocoon pipeline, so that's handy, too. If this is of interest, I can send you some code that you could use as a template. Both of these assume the documents are in XML, and that you're willing to do a lot of coding. Another option might be eXist. It's fairly easy to get up and running, and it can be more of an out-of-the-box solution (as long as you're prepared to open a complicated box) to index, search, and display documents, but I've heard it can be unstable for very large data sets like the one you're talking about. I'd love to hear about other people's recent experiences with eXist. At my library we keep wanting to use it, but having been burned by a previous release we're feeling a little hesitant. Cheers, Bess Elizabeth (Bess) Sadler Metadata Specialist for User Projects Digital Research and Instructional Services (DRIS) Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [log in to unmask] (434) 243-2305