LISTSERV 16.5 - CODE4LIB Archives

On Jun 17, 2008, at 12:06 PM, siznax wrote:

> the Zebralist appears to be relatively inactive.
> anyone here have experience indexing MARC binaries
> with zebra?


I will try to outline here how to index (and search) MARC records  
using Zebra, but tweaking the indexing process is a bit trickier than  
I know how to do.

   1. Install yaz, zebra, and all of their friends. I have found that  
the "standard" make process works pretty well, but allow yaz and  
zebra to specify where it puts various configuration files. The extra  
specification is not worth the effort.

   2. Save your MARC records someplace on your file system. By  
"binary" MARC records, I suppose you mean "real" MARC records -- MARC  
records in communications format -- MARC records as the types of  
records fed to traditional integrated library systems. This is  
opposed to some flavor of XML or "tagged format" often used for display.

   3. Create a zebra.cfg file, and have it look something like this:

       # global paths
       profilePath: .:./etc:/usr/local/share/idzebra-2.0/tab
       modulePath: /usr/local/lib/idzebra-2.0/modules

       # turn ranking on
       rank: rank-1

       # define a database of marc records called opac
       opac.database: opac
       opac.recordtype: grs.marcxml.marc21
       attset: bib1.att
       attset: explain.att

   4. Index your MARC records with the following command. You should  
see lot's of great stuff sent to STDOUT.

       zebraidx -g opac update <path to MARC records>


You have now created your index. Once you get this far with indexing,  
you will want to tweak various .abs files (I think) to enhance the  
indexing process. This particular thing is not my forte. It seems  
like black magic to most of us. This is not a Zebra-specific problem;  
this is a problem with Z39.50.

Next, you need to implement the client/server end of things:

   5. Start your server. This will be a Z39.50 server -- a "kewl"  
library-centric protocol that existed before the Internet got hot:

       zebrasrv localhost:9999 &

   6. Use yaz-client to search your index:

       & yaz-client
       Z> open localhost:9999/opac
       Z> find origami
       Z> show 1
       Z> quit

Using the yaz-client almost requires a knowledge of Z39.50. Attached  
should be a Perl script that allows you to search your server in a  
bit more user-friendly way. To use it you will need to install a few  
Perl modules and then edit the constant called DATABASE.

Even though Z39.50 is/was "kewl" it is still pretty icky. SRU is  
better -- definitely a step in the right direction, and Zebra  
supports SRU out of the box. [1]

   7. Create an an SRU configuration file looking something like this:

      <yazgfs>
        <server>
          <config>zebra.cfg</config>
          <cql2rpn>pqf.properties</cql2rpn>
        </server>
      </yazgfs>

   8. Acquire a "better" pqf.properties file. PQF is about querying  
Z39.50 databases. It is ugly. It was designed in a non-Internet  
world. Instead of knowing that 1=4 means search the title field, you  
want to simply search the title. Attached is a "better"  
pqf.properties file, and it is "better" because it maps things like  
1=4 to Dublin Core equivalents. Save it in a directory called etc in  
the same directory as your zebra.cfg file. (Notice how the zebra.cfg  
file, above, denotes etc as being in zebra's path.)

   9. Kill your presently running Z39.50 server.

  10. Start up a SRU server:

       zebrasrv -f sru.cfg localhost:9999 &

  11. Use your HTTP client to search the SRU server. Queries will  
look like this (with carriage returns added for readability):

       http://localhost:9999/opac?
        operation=searchRetrieve&
        version=1.1&
        query=origami&
        maximumRecords=5

The result should be a stream of XML ready for XSLT processing.

All of the above is almost exactly what I did to create an index of  
MARC records harvested from the Library of Congress and the  
University of Michigan's OAI data repository (MBooks). [2] Take a  
look at the HTML source. Notice how the client in this regard is only  
one HTML file containing a form, one CSS file for style, and one XSL  
file for XML to HTML transformation.

HTH.

[1] SRU - http://www.loc.gov/standards/sru/
[2] Example SRU interface - http://infomotions.com/ii/

-- 
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604