LISTSERV 16.5 - CODE4LIB Archives

I have created an initial pile of RDF, mostly.

I am in the process of experimenting with linked data for archives. My goal is to use existing (EAD and MARC) metadata to create RDF/XML, and then to expose this RDF/XML using linked data principles. Once I get that far I hope to slurp up the RDF/XML into a triple store, analyse the data, and learn how the whole process could be improved. 

This is what I have done to date:

  * accumulated sets of EAD files and MARC
    records

  * identified and cached a few XSL stylesheets
    transforming EAD and MARCXML into RDF/XML

  * wrote a couple of Perl script that combine
    Bullet #1 and Bullet #2 to create HTML and
    RDF/XML

  * write a mod_perl module implementing
    rudimentary content negotiation

  * made the whole thing (scripts, sets of data,
    HTML, RDF/XML, etc.) available on the Web

You can see the fruits of these labors at http://infomotions.com/sandbox/liam/, and there you will find a few directories:

  * bin - my Perl scripts live here as well as
    a couple of support files

  * data - full of RDF/XML files -- about 4,000
    of them

  * etc - mostly stylesheets

  * id - a placeholder for the URIs and content
    negotiation

  * lib - where the actual content negotiation
    script lives

  * pages - HTML versions of the original metadata

  * src - a cache for my original metadata

  * tmp - things of brief importance; mostly trash

My Perl scripts read the metadata, create HTML and RDF/XML, and save the result in the pages and data directories, respectively. A person can browse these directories, but browsing will be difficult because there is nothing there except cryptic file names. Selecting any of the files should return valid HTML or RDF/XML. 

Each cryptic name is the leaf of a URI prefixed with "http://infomotions.com/sandbox/liam/id/". For example, if the leaf is "mshm510", then the combined leaf and prefix form a resolvable URI -- http://infomotions.com/sandbox/liam/id/mshm510. When user-agent says it can accept text/html, then the HTTP server redirects the user-agent to http://infomotions.com/sandbox/liam/pages/mshm510.html. If the user agent does not request a text/html representation, then the RDF/XML version is returned -- http://infomotions.com/sandbox/liam/data/mshm510.rdf. This is rudimentary content-negotiation. For a good time, here are a few actionable URIs:

  * http://infomotions.com/sandbox/liam/id/4042gwbo
  * http://infomotions.com/sandbox/liam/id/httphdllocgovlocmusiceadmusmu004002
  * http://infomotions.com/sandbox/liam/id/ma117
  * http://infomotions.com/sandbox/liam/id/mshm509
  * http://infomotions.com/sandbox/liam/id/stcmarcocm11422551
  * http://infomotions.com/sandbox/liam/id/vilmarcvil_155543

For a good time, feed them to the W3C RDF Validator. 

The next step is to figure out how to handle file not found errors when a URI does not exist. Another thing to figure out is how to make potential robots aware of the data set. The bigger problem is to simply make the dataset more meaningful the the inclusion of more URIs in the RDF/XML as well as the use of a more consistent and standardized set of ontologies. 

Fun with linked data?

— 
Eric Morgan