Andrew, just as an additional data point, we have millions of records indexed in our Lucene-based XTF system, and the response isn't too bad even on a development server. Roy On Oct 28, 2005, at 1:31 PM, Andrew Nagy wrote: > Wow! Thanks for such a detailed reply ... this is awesome. > > I am thinking about storing the data from the catalog in an XML > database > as well, however since I know very little about these I am greatly > concerned about the scalability ... can they handle the 800,000+ > records > we have in our catalog? If I am just using it as a store, and then > use > some sort of indexer, this shouldn't be a concern? > > Lucene seems enticing over Zebra since it is a z39 interface which > from > what I can understand will not let me do fancy searches such as > what was > recently cataloged in the past 7 days, etc. > What about Xapian or XTF, did you test these out at all? I guess > lucene > seems like a better product because it is an apache project? > > Thanks for all the Info! > > Andrew > > > Ross Singer wrote: > > >> This is pretty similar to the project that Art Rhyno and I have been >> working on for a couple of months now. Thankfully, I just got the >> go-ahead to make it the top development priority, so hopefully we'll >> actually have something to see in the near future. Like Eric, we >> don't >> have any problem with (and there aren't touching) any of the backend >> stuff (cataloging, acq, circ), but have major issues with the public >> interface. >> >> Although the way we're extracting records from our catalog is a >> little >> different (and there are reasons for it), the way I would recommend >> getting the data out of the opac is not via z39.50, but through >> whatever sort of marcdump utility your ILS has. You can then use >> marc4j (or something similar) to transform the marc to xml (we're >> going >> to MODS, for example). Although we're currently just writing this >> dump >> to a filesystem (broken up by LCC... again, there are reasons that >> don't exactly apply to this project), but I anticipate this will >> eventually go into a METS record and a Berkeley xmldb for >> storage. For >> indexing, we're using Lucene (Art is accessing it via Cocoon, I am >> through PyLucene) and we're, so far, pretty happy with the results. >> >> If Lucene has issues, we'll look at Zebra (as John mentioned), >> although >> Zebra's indexes are enormous. The nice thing about Zebra, though, is >> that it would forgo the need for the Berkeley DB, since it stores the >> XML record. The built-in Z39.50 server is a nice bonus, as well. >> Backups would be XTF (http://www.cdlib.org/inside/projects/xtf/) and >> Xapian. Swish-e isn't really an option since it can't index utf-8. >> >> The idea then is to be able to make stronger relationships between >> our >> site's content... eliminate the silos. A search that brings back a >> couple of items that are in a particular subject guide would get a >> link >> to the subject... or at least links to the other "top" items from >> that >> guide (good tie in with MyLibrary, Eric). Something that's on >> reserve >> would have links to reserve policies or a course guide for that >> course >> or whatever. >> >> Journals would have links to the databases they are indexed in. >> >> Yes, there's some infrastructure that needs to be worked out... :) >> >> But the goal is to have something to at least see by the end of the >> year (calendar, not school). >> >> We'll see :) >> >> -Ross. >> >> On Oct 27, 2005, at 5:58 PM, Eric Lease Morgan wrote: >> >> >>> On Oct 27, 2005, at 2:06 PM, Andrew Nagy wrote: >>> >>> >>>>> http://mylibrary.ockham.org/ >>>>> >>>> >>>> >>>> I have been thinking of ways, similiar to what you have done >>>> that you >>>> mentioned below with the Ockham project, to allow more modern day >>>> access >>>> with our library catalog. I have been beginning to think about >>>> devising >>>> a way to index/harvest our entire catalog (and allow this indexing >>>> process to run every so often) to allow our own custom access >>>> methods. >>>> We could then generate our own custom RSS feeds of new books, allow >>>> more >>>> efficient/enticing search interfaces, etc. >>>> >>>> Do you know of any existing software for indexing or harvesting a >>>> catalog into another datastore (SQL Database, XML Database, etc). >>>> I am >>>> sure I could fetch all of the records somehow through Z39.50 and >>>> dump it >>>> into a MySQL database, but maybe there is some better method? >>>> >>> >>> >>> I too have thought about harvesting content from my local catalog >>> and >>> providing new interfaces to the content, and I might go about >>> this in >>> a number of different ways. >>> >>> 1. I might use OAI to harvest the content, cache is locally, and >>> provide services against the cache. This cache might be saved on a >>> file system, but more likely into a relational database. >>> >>> 2. I might simply dump all the MARC records from my catalog, >>> transform them into something more readable, say sets of HTML/XML >>> records, and provide services against these files. >>> >>> The weakest link in my chain would be my indexer. Relational >>> databases are notoriously ill-equipped to handle free text >>> searching. >>> Yes, you can implement it and you can use various database-specific >>> features to implement free text searching, but they still won't work >>> as well as an indexer. My only experience with indexers lies in >>> things like swish-e and Plucene. I sincerely wonder whether or not >>> these indexers would be up to the task. >>> >>> Supposing I could find/use an indexer that was satisfactory, I would >>> then provide simple and advanced (SRU/OpenSearch) search features >>> against the index of holdings. Search results would then be enhanced >>> with the features such as borrow, re-new, review, put on reserve, >>> save as citation, email, "get it for me", put on hold, "what's >>> new?", >>> view as RSS, etc. These services would require a list of authorized >>> users of the system -- a patron database. >>> >>> In short, since I would have direct access to the data, and since I >>> would have direct to the index, I would use my skills to provide >>> services them. For the most part, I don't mind back-end, >>> administrative, data-entry interfaces to our various systems, but I >>> do have problems with the end-user interfaces. Let me use those >>> back- >>> ends to create and store my data, then give me unfettered access to >>> the data and I will provide my own end-user interfaces. Another >>> alternative is to exploit (industry standard) Web Services computing >>> techniques against the existing integrated library system. In this >>> way you get XML data (information without presentation) back and you >>> can begin to do the same things. >>> >>> -- >>> Eric Lease Morgan >>> University Libraries of Notre Dame >>> >>> > >