LISTSERV 16.5 - CODE4LIB Archives

To index and display did-level EAD elements,
Or to index finding aids as a whole.
That is the question.

Seriously, this discussion surrounding the indexing and display of EAD files is extraordinarily timely, but the loose consensus on how to do it does not jive with my experience. In short, I have been told by my archivist friends that I need to index and display each and every did-level element in my EAD files, and then provide a link to the finding aid as a whole. Let me explain.

Here at Notre Dame we are leading an effort we colloquially call the "Catholic Portal". [1] We use VUFind as our "discovery system" and thus Solr as the underlying indexer. Much of the metadata I index is MARC-based, but increasingly it is and will be EAD-based. Using VUFind to index MARC records is well-understood. Until only very recently has it been truly feasible to index content other than MARC, such as EAD. A few months ago time was spent parsing EAD and stuffing it into the underlying Solr index. We took metadata from the EAD header and mapped it to Solr fields. We then free text indexed the balance. Thus searches for anything found in the EAD was returned complete with EAD title, author, etc. Links to the original EAD were then provided. The process functioned, but it was not deemed good enough by the archivists in the crowd.

As you know, EAD files are not structured like most MARC records. An EAD file represents an entire collection. Within that collection there may be sub-collections upon sub-collections. While the EAD's header and archdesc element may describe the collection as a whole, the sub-level and nested did elements are the real meat of the matter. Free text searches over the entire EAD that only return the over-arching metadata do not put search results in context, even if one does provide links out to the full finding aid. Instead (ideally), each and every did needs to be indexed and displayed in search results. Moreover (ideally), these search results need to be displayed in their hierarchal relationship with the balance of the EAD file.

We began work to implement this (ideal) solution [2], but the developer went on to a more permanent job here on campus.

Here is what I plan to do:

1. acquire EAD files from "Catholic Portal" participants
2. cache them locally
3. pre-process each EAD making sure they have eadid elements
4. pre-procees each EAD making sure each did element contains
a unitid element, and if they don't then assign them one
5. store and index each EAD file in Archon [3]
6. parse each did from each EAD file and integrate the result
into the VUFind/Solr index along with the MARC metadata
7. use VUFind as the primary interface to the "Catholic Portal"
8. use Archon as the means for displaying and navigating EAD files
9. go to Step #1

Actually, my plan is not very much different from everybody else's plan. I'm using Solr as my indexer but the VUFind/Solr schema instead of Blacklight's. For simplicity's sake, I'm using Archon for storing/displaying my EAD instead of Fedora. (You say tomāto. I say tomäto. [4]) The most significant difference is the level at which I am expected to index and display the EAD files. I see a whole lot of XPath queries in my future.

[1] Catholic Portal - http://www.catholicresearch.net
[2] indexing EAD - http://serials.infomotions.com/code4lib/archive/2010/201007/1957.html
[3] Archon - http://www.archon.org/
[4] (Don't ya just gotta love Unicode.)

--
Eric Lease Morgan