LISTSERV 16.5 - CODE4LIB Archives

To database or not to database (2 db || ~2 db), that is the question.
Put another way, I am having a difficult time deciding to what degree I
should use a database application to manage a collection of electronic
texts. Allow me to explain.

I host a collection of electronic texts called the Alex Catalogue. It
needs a facelift in terms of both aesthetics and functionality. The
collection consists of "great" public domain texts from American and
English literature as well as Western philosophy. The idea behind the
Catalogue is, if you were to read and understand all of these 500 or so
items, then you would have a pretty good understanding of Western
culture. Here are links to what I have so far:

   * http://infomotions.com/alex/
   * http://infomotions.com/alex2/


Functionality-wise, a future implementation of the Catalogue will:

* be accessible via authors, titles, a set of controlled vocabulary
terms, as well as free-text searching

* searches will return not only author, titles, and links, but also
paragraph-level detail much like a concordance

* search results will be sortable by author, title, date, rank,
popularity, size, etc.

* author names (the authority list) will be supplemented with
rudimentary biographical information

* controlled vocabulary terms will include things like subjects,
literary form, genre, etc.

* each document will ultimately be saved as a TEI/XML file, enabling me
to transform the file(s) into a myriad of different forms such as HTML,
"smart" HTML, plain text, PDF, PalmPilot, Rocket eBook, OEB, Newton
Paperback, MARC, MARCXML, MODS, METS, etc.

* provide a Search Inside The Book feature a la Amazon

* provide a Did You Mean feature a la Google

* allow harvesting via OAI

* allow syndication of hand-selected and randomly-selected items
through RSS

* provide a MyAlex feature for customization/personalization

* each item will be associated with one image to give the items'
graphic appeal

* the entire corpus with much of its functionality will be
distributable on a CD but require no program to use -- just the CD and
the data

* items will be printable in such a way that they can be bound in a
pretty manner


To what degree do I use a database to implement these features?
Maintaining an authority list and sets of controlled vocabulary terms
almost necessitates a database application. Fine. No problem. I can
accept that. But do I create database of the Catalogue's metadata and
then point to the TEI files? Ick! That is too fragile, and IMHO not
very elegant.

Alternatively, I could store the entire TEI files into a database. It
is not like the database can not handle the file size, but then the
question is, "How do I do data-entry against the database?" Many of
these texts are a few hundred K in size, and consequently not very
amenable to CGI forms.

Yet another approach would be to create my TEI files, use the
filesystem as the database, and regularly crawl the filesystem to
create indexes of various types. I suppose I could this using XSL
technology.

What do you think? What parts of a full-text catalog would you
implement as a database application, and what parts would you not?

--
Eric Lease Morgan
University Libraries of Notre Dame