Print

Print


Chris,

In general I would say keep the documents separate, and build some sort of
process that combines the relevant parts of the documents for ingestion into
MarkLogic.  However, not having any idea how MarkLogic works, that solution
may not be an option.  We're doing quite a bit with TEI and MODS documents
in Solr, and we just have a service that generates a Solr indexing document
out of our XML.   

> * It seems the structure of TEI documents can be problematic since they
> follow a logical structure, by paragraphs/sections. And the structMap of all
> our METS documents, so far, are divided up by pages of text, not paragraphs.
> So the TEI structure does not fit nicely into METS the way we're using METS.

In general, yeah.  You can put page breaks (<pb/>) in TEI, but it would be a
clunky way to address sections of the document from your METS.

> * We're also concerned with not having redundant metadata in the TEI header
> and the dmdSec of the METS document. So, we're considering keeping the TEI
> header very brief and relying on the METS doc for
> descriptive/administrative/technical metadata. (We won't be deriving METS
> from TEI which is another issue.)

METS seems like a lower-level standard than the TEI header, so that makes
sense.

> * The other issue has already been raised by Liza Daly: performance. We've
> been told by one of the programmers at Mark Logic that we should embed the
> TEI docs into METS for good performance, but we have other reasons why we
> don't want to  embed the TEI (editing, maintenance, etc.). So, we are
> considering writing a script that would integrate the METS and TEI at the
> point a search is deployed.

Does MarkLogic operate directly on the XML or does it index it?  If it is
running Xqueries or something like that, you may not see much of a
performance increase by splitting them out.  In fact I'd say that Xqueries
are typically a lot faster when they're operating on a single document or
collection of similar document.  You may also want to consider what comes
after MarkLogic. 
 
> * From the metadata standpoint, I want to keep the TEI docs separate and
> link out to them from the METS docs, because I'm not convinced that library
> metadata standards are stable. If we move away from using METS in the next
> 5-10 years, I think it would be easier if all the text/image files remained
> separate from the metadata. So, I'd prefer links in the fileSec of METS that
> link out to external TEI files.
> 

That makes sense, although since your TEI would all be namespaced it
wouldn't be too hard to extract it if necessary.  I would be concerned with
the future ramifications of having your objects optimized for a legacy
system.   

-Andy