LISTSERV 16.5 - CODE4LIB Archives

Wow!  Thanks for such a detailed reply ... this is awesome.

I am thinking about storing the data from the catalog in an XML database
as well, however since I know very little about these I am greatly
concerned about the scalability ... can they handle the 800,000+ records
we have in our catalog?  If I am just using it as a store, and then use
some sort of indexer, this shouldn't be a concern?

Lucene seems enticing over Zebra since it is a z39 interface which from
what I can understand will not let me do fancy searches such as what was
recently cataloged in the past 7 days, etc.
What about Xapian or XTF, did you test these out at all?  I guess lucene
seems like a better product because it is an apache project?

Thanks for all the Info!

Andrew


Ross Singer wrote:

> This is pretty similar to the project that Art Rhyno and I have been
> working on for a couple of months now.  Thankfully, I just got the
> go-ahead to make it the top development priority, so hopefully we'll
> actually have something to see in the near future.  Like Eric, we don't
> have any problem with (and there aren't touching) any of the backend
> stuff (cataloging, acq, circ), but have major issues with the public
> interface.
>
> Although the way we're extracting records from our catalog is a little
> different (and there are reasons for it), the way I would recommend
> getting the data out of the opac is not via z39.50, but through
> whatever sort of marcdump utility your ILS has.  You can then use
> marc4j (or something similar) to transform the marc to xml (we're going
> to MODS, for example).  Although we're currently just writing this dump
> to a filesystem (broken up by LCC... again, there are reasons that
> don't exactly apply to this project), but I anticipate this will
> eventually go into a METS record and a Berkeley xmldb for storage.  For
> indexing, we're using Lucene (Art is accessing it via Cocoon, I am
> through PyLucene) and we're, so far, pretty happy with the results.
>
> If Lucene has issues, we'll look at Zebra (as John mentioned), although
> Zebra's indexes are enormous.  The nice thing about Zebra, though, is
> that it would forgo the need for the Berkeley DB, since it stores the
> XML record.  The built-in Z39.50 server is a nice bonus, as well.
> Backups would be XTF (http://www.cdlib.org/inside/projects/xtf/) and
> Xapian.  Swish-e isn't really an option since it can't index utf-8.
>
> The idea then is to be able to make stronger relationships between our
> site's content... eliminate the silos.  A search that brings back a
> couple of items that are in a particular subject guide would get a link
> to the subject... or at least links to the other "top" items from that
> guide (good tie in with MyLibrary, Eric).  Something that's on reserve
> would have links to reserve policies or a course guide for that course
> or whatever.
>
> Journals would have links to the databases they are indexed in.
>
> Yes, there's some infrastructure that needs to be worked out... :)
>
> But the goal is to have something to at least see by the end of the
> year (calendar, not school).
>
> We'll see :)
>
> -Ross.
>
> On Oct 27, 2005, at 5:58 PM, Eric Lease Morgan wrote:
>
>> On Oct 27, 2005, at 2:06 PM, Andrew Nagy wrote:
>>
>>>>  http://mylibrary.ockham.org/
>>>
>>>
>>> I have been thinking of ways, similiar to what you have done that you
>>> mentioned below with the Ockham project, to allow more modern day
>>> access
>>> with our library catalog.  I have been beginning to think about
>>> devising
>>> a way to index/harvest our entire catalog (and allow this indexing
>>> process to run every so often) to allow our own custom access methods.
>>> We could then generate our own custom RSS feeds of new books, allow
>>> more
>>> efficient/enticing search interfaces, etc.
>>>
>>> Do you know of any existing software for indexing or harvesting a
>>> catalog into another datastore (SQL Database, XML Database, etc).
>>> I am
>>> sure I could fetch all of the records somehow through Z39.50 and
>>> dump it
>>> into a MySQL database, but maybe there is some better method?
>>
>>
>> I too have thought about harvesting content from my local catalog and
>> providing new interfaces to the content, and I might go about this in
>> a number of different ways.
>>
>> 1. I might use OAI to harvest the content, cache is locally, and
>> provide services against the cache. This cache might be saved on a
>> file system, but more likely into a relational database.
>>
>> 2. I might simply dump all the MARC records from my catalog,
>> transform them into something more readable, say sets of HTML/XML
>> records, and provide services against these files.
>>
>> The weakest link in my chain would be my indexer. Relational
>> databases are notoriously ill-equipped to handle free text searching.
>> Yes, you can implement it and you can use various database-specific
>> features to implement free text searching, but they still won't work
>> as well as an indexer. My only experience with indexers lies in
>> things like swish-e and Plucene. I sincerely wonder whether or not
>> these indexers would be up to the task.
>>
>> Supposing I could find/use an indexer that was satisfactory, I would
>> then provide simple and advanced (SRU/OpenSearch) search features
>> against the index of holdings. Search results would then be enhanced
>> with the features such as borrow, re-new, review, put on reserve,
>> save as citation, email, "get it for me", put on hold, "what's new?",
>> view as RSS, etc. These services would require a list of authorized
>> users of the system -- a patron database.
>>
>> In short, since I would have direct access to the data, and since I
>> would have direct to the index, I would use my skills to provide
>> services them. For the most part, I don't mind back-end,
>> administrative, data-entry interfaces to our various systems, but I
>> do have problems with the end-user interfaces. Let me use those back-
>> ends to create and store my data, then give me unfettered access to
>> the data and I will provide my own end-user interfaces. Another
>> alternative is to exploit (industry standard) Web Services computing
>> techniques against the existing integrated library system. In this
>> way you get XML data (information without presentation) back and you
>> can begin to do the same things.
>>
>> --
>> Eric Lease Morgan
>> University Libraries of Notre Dame
>>