Jakob,
I've looked briefly through your document. There are some interesting ideas in it. I remember learning about hypertext and playing with TeX and other markup languages before HTML and the World Wide Web. I agree that the current common understanding of hypertext has changed since then. I will have to read your document more thoroughly when I have time. Thank you!
Steve McDonald
[log in to unmask]
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Voß, Jakob
Sent: Wednesday, November 11, 2020 3:47 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] modeling data and metadata for repository ingest
Hi Steve,
Any answer to your question depends on many unknown factors and is highly opiniated. First of all you are definitely not the first to create such a data wrangling and conversion toolkit. One kind of such systems are build systems, I often use Makefiles but if you are most familiar with Ruby you may want to have a look at Apache Buildr. Second I'd recommend to rely on existing command line tools and use your framework only as glue. I can recommend Catmandu but there is no universal framework for all kinds of data formats. As you are working with files the best environment is the command line. You wrote:
> I am rewriting the toolkit from scratch, with a modular design. I want a consistent set of methods defined in an abstract class for a package of data (which I am calling a Tree), with subclasses defining the exact behavior of the methods for directories, zipfiles, images with imbedded metadata, etc. I'm sure this is familiar to some of you. A file or directory (or analog) within a Tree is defined as a path from the root of the Tree
I'd use a differ terminology but same idea:
* your have digital OBJECTS (a file or data stream with known format...), identified by an identifier (could be the path of a file or an URL or an id assigned to a temporary result in your process)
* and LOCATORS to extract content from an object.
A locator can be an path expression such as XPath for XML data, a jq script for JSON but also an arbitrary shell script if the conversion requires multiple steps and conditions. You could also call it the locator a conversion but in most cases you primarily want to extract a part instead of converting the whole object, don't you?
> Does anyone have an opinion on which would be better?
I'd stick to three arguments: sourceId, locator and targetId. A storage component is needed to know how to read data from a sourceId and write data to a targetId, unless you simply use file pathes.
To combine parts of multiple objects you need an additional type of document which I'd call edit lists. These can be arbitrary scripts as well, not only ruby or bash because the best language depends on which data formats you are dealing with. The only requirement to edit lists is they should to access data with sourceId and locator as well, so you can trace data dependencies.
As already said, this sure is opiniated. I've written a paper about the model: https://jakobib.github.io/hypertext2019/ In your case I'd not require content-based identifiers but allow simple file system path or URLs.
Best wishes,
Jakob
|