Print

Print


Using Dublin Core, is there a way to express size measured in words?

I have a collection of more than 3,000 data sets. Each data set may include as many as a few thousand journal articles, a couple hundred books, or a myriad of Web pages. These data sets cover two very broad topic areas: COVID-19 and "great ideas" like love, honor, truth, justice, beauty, etc.

I am in the process of curating the collection, and I want to rigorously describe each item. Modeling the metadata in a relational database is easy, and because the data sets (by definition) are well-structured, it is almost trivial to fill the database with records. While the database will be my canonical container for the metadata, I will ant to expose the metadata in a number of different ways. Examples may include; OAI-PMH, flavors of linked data (RDF/XML, JSON-LD), Sparql, etc. Ultimately, I will have to map my metadata to something like Dublin Core, and for most metadata, the mapping is easy, especially if I exploit the terms (http://purl.org/dc/terms/) namespace, which I think used to be called "Qualified Dublin Core".

But a few characteristics are throwing me for loop. The first is number of words. The size of a data set, measured in words, is very useful information. For example, data sets whose size is less than 1,000,000 words does not lend itself to semantic indexing, and this would be good to know before the dataset is downloaded. The second is number of items in the data set, which is an indicator of comprehensiveness. Finally, the data set could be a mere 10 MB bytes size where other data sets might come close to a gigabyte. I need/want to express extent in a number of ways. 

Using Dublin Core, how can I express the size of a data set measured in number of words, number of items, or size in bytes? Here is a snippet of RDF/XML where I express size in bytes, but I not satisfied with the result because the units are not explicitly expressed:

  <dc:format>
    <dcterms:extent>
      <rdf:value>100</rdf:value>
      <rdfs:label>100 MB (compressed)</rdfs:label>
    </dcterms:extent>
  </dc:format>

What am I missing? How can this snippet be improved? How can I apply the same technique to denote size in words or number of items? Can I do this without creating my own namespace? Attached ought to be a valid RDF/XML file with bogus values for things like creator, title, subjects, etc.

(Once I figure out how to exploit extent, I will want to learn how to exploit table of contents notes.)

--
Eric Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame