LISTSERV 16.5 - CODE4LIB Archives

When exposing sets of MARC records as linked data, do you think it is better to expose them in batch (collection) files or as individual RDF serializations? To bastardize the Bard — “To batch or not to batch? That is the question.”

Suppose I am a medium-sized academic research library. Suppose my collection is comprised of approximately 3.5 million bibliographic records. Suppose I want to expose those records via linked data. Suppose further that this will be done by “simply” making RDF serialization files (XML, Turtle, etc.) accessible via an HTTP filesystem. No scripts. No programs. No triple stores. Just files on an HTTP file system coupled with content negotiation. Given these assumptions, would you:

1. create batches of MARC records, convert them to MARCXML
and then to RDF, and save these files to disc, or

2. parse the batches of MARC record sets into individual
records, convert them into MARCXML and then RDF, and
save these files to disc

Option #1 would require heavy lifting against large files, but the number of resulting files to save to disc would be relatively few — reasonably managed in a single directory on disc. On the other hand, individual URIs pointing to individual serializations would not be accessible. They would only be accessible by retrieving the collection file in which they reside. Moreover, a mapping of individual URIs to collection files would need to be maintained.

Option #2 would be easier on the computing resources because processing little files is generally easier than processing bigger ones. On the other hand, the number of files generated by this option is not easily be managed without the use of a sophisticated directory structure. (It is not feasible to put 3.5 million files in a single directory.) But I would still need to create a mapping from URI to directory.

In either case, I would probably create a bunch of site map files denoting the locations of my serializations — YAP (Yet Another Mapping).

I’m leaning towards Option #2 because individual URIs could be resolved more easily with “simple” content negotiation.

(Given my particular use case — archival MARC records — I don’t think I’d really have more than a few thousand items, but I’m asking the question on a large scale anyway.)

—
Eric Morgan