For a good time, I've written a Perl script called oai2mylibrary.pl and
indexed the result:
http://tinyurl.com/5ddza
Here is how it works:
1. Harvest OAI content from a repository.
2. Map the DC metadata to MyLibrary facets/terms and field names
3. Save the bibliographic information + abstracts to the MyLibrary
database.
4. Write a report against MyLibrary database in the form of something
an indexer (swish-e) can index.
5. Provide access to the index.
In this particular case I created a MyLibrary facet/term combination
called formats/articles. I created another facet/term combination
called subjects/philosophy. I then selected an OAI repository that
contained philosophy articles (Cogprints), harvested the articles from
a set defined as philosophy, and saved them to the underlying MyLibrary
database using the MyLibrary OO Perl modules. I then wrote a script
exporting all of the articles and piped it on to swish-e for indexing.
Finally, I wrote a simple CGI script allowing access to the index. The
script supports a rudimentary Did You Mean? service a la Google. In
other words, if it doesn't find anything, then it will try to create a
new query that will find something. For example, search for xyzzy.
Presently there are only 640 articles in the index.
My next step will be allow oai2mylibrary.pl to take an XML
(configuration) file as input. The XML will enable the user (librarian)
to define an unlimited number of OAI repositories and sets, as well as
how each of the items from the sets should be classified. My problem is
learning how to use XML::Simple to read my XML file and extract the
necessary data from the resulting (huge) hash. Another problem is to
discover a way to internationalize the scripts making easier for
non-English speaking people to implement it.
What is also cool is that the data is in MyLibrary I can do other
things with it such as create RSS feeds, MARC-like records, browsable
lists, or syndicate the content to other venues. Fun!
For a good time, you can also see how an incarnation of a MyLibrary
administrative interface is shaping up, here:
http://dewey.library.nd.edu/mylibrary/sandbox/cgi-admin/
--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
University Libraries of Notre Dame
(574) 631-8604
|