Print

Print


Stephen,
As the lead developer on the SobekCM open-source digital repository project and formerly a developer for the University of Florida Libraries, I have looked at this quite a bit and learned a bit over time.

I began development working on tracking systems to manage a fairly large-scale digitization shop at UF before I was even working on the public repository side.  When I arrived (around 1999) metadata was double keyed several times for each item during the tracking and metadata creation process.  It seemed obvious to me that we needed a tracking system and one that would hold metadata for each item.  This was fairly easy to do when our metadata was very homogenous and based on simple Dublin Core.  This worked well and the system could easily spit out ready METS (and MXF) packages.

Over time, I began to experiment with MODS and increasingly started using specialized metadata schemas for different types of objects, such as herbarium or oral history materials.  I envisioned a tracking system that would hold all of this metadata relationally and provide different tabs based on the material type.  So, oral history items would have an extra tab exposing the oral history metadata and herbarium would have a similar special tab.  While development of this moved ahead, the entire system seemed unwieldy.  Adding a new schema was a bit laborious.. even adding a new field to use.  

After several years of this, we began the SobekCM digital repository software development.  After that experience I swore off trying to store very complex structured data in the database in the same type of format.  (This may also have had to do with an IMLS project I worked on that proved the futility of this approach.)  I generally eschew triple-stores for the basis of libraries in favor of relational databases on the premise that we DO actually understand the basic relationships of digital resources to collection and the sub-relations there.  We keep the data within METS files with one or more descriptive metadata sections and essentially the database only points to that METS file.  For searching, we use a flattened table structure with one row per item, much like Solr/Lucene, and Solr/Lucene itself.

My advice is to steer clear of trying to take beautifully (and deeply) structured metadata from MODS, Darwin Core, VRACore (and who knows what else) and try to create tables and relations for them.

I think you can point some database tools at the schema and have it generate the tables for you.  Just doing that will probably dissuade you.  ;)

Mark V. Sullivan
CIO & Application Architect
Sobek Digital Hosting and Consulting, LLC
[log in to unmask]
352-682-9692 (mobile)​​​


________________________________________
From: Code for Libraries <[log in to unmask]> on behalf of Stephen Schor <[log in to unmask]>
Sent: Friday, April 17, 2015 1:27 PM
To: [log in to unmask]
Subject: [CODE4LIB] Modeling a repository's objects in a relational database

Hullo.

I'm interested to hear about people's approaches for modeling
repository objects in a normalized, spec-agnostic way, _relational_ way
while
maintaining the ability to cast objects as various specs (MODS, Dublin
Core).

People often resort to storing an object as one specification (the text of
the MODS for example),
and then convert it other specs using XSLT or their favorite language,
using established
mappings / conversions. (
http://www.loc.gov/standards/mods/mods-conversions.html)

Baking a MODS representation into a database text field can introduce
problems with queryablity and remediation that I _feel_ would be hedged
by factoring out information from the XML document, and modeling it
in a relational DB.

This is idea that's been knocking around in my head for a while.
I'd like to hear if people have gone down this road...and I'm especially
eager to hear both success and horror stories about what kind of results
they got.

Stephen