On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew <[log in to unmask]> wrote:
> Too bad I didn't attend code4lib. OCLC Research has created a version of
> MARC in JSON and will probably release FAST concepts in MARC binary,
> MARC-XML and our MARC-JSON format among other formats. I'm wondering
> whether there is some consensus that can be reached and standardized at LC's
> level, just like OCLC, RLG and LC came to consensus on MARC-XML.
> Unfortunately, I have not had the time to document the format, although it
> fairly straight forward, and yes we have an XSLT to convert from MARC-XML to
> MARC-JSON. Basically the format I'm using is:
>
>
The stuff I've been doing:
http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
... is pretty much the same, except:
1. I don't explicitly split up control and data fields. There's a single
field list; an item that has two elements is a control field (tag/data); one
with four is a data field (tag / ind1 /ind2 / array_of_subfield)
2. Instead of putting a collection in a big json array, I use
newline-delimited-json (basically, just stick one record on each line as a
single json hash). This has the advantage that it makes streaming much, much
easier, and makes doing some other things (e.g., grab the first record or
two) much cheaper for even the dumbest json parser). I'm not sure what the
state of JSON streaming parsers are; I know Jackson (for Java) can do it,
and perl's JSON::XS can...kind of...but it's not great.
3. I include a "type" (MARC-JSON, MARC-HASH, whatever) and version: [major,
minor] in each record. There's already a ton of JSON floating around the
library world; labeling what the heck a structure is is just friendly :-)
MARC's structure is dumb enough that we collectively basically can't miss;
there's only so much you can do with the stuff, and a round-trip to JSON and
back is easy to implement.
I'm not super-against explicitly labeling the data elements (tag:, :ind1:,
etc.) but I don't see where it's necessary unless you're planning on adding
out-of-band data to the records/fields/subfields at some point. Which might
be kinda cool (e.g., language hints on a per-subfield basis? Tokenization
hints for non-whitespace-delimited languages? URIs for unique concepts and
authorities where they exist for easy creation of RDF?)
I *am*, however, willing to push and push and push for NDJ instead of having
to deal with streaming JSON parsing, which to my limited understanding is
hard to get right and to my more qualified understanding is a pain in the
ass to work with.
And anything we do should explicitly be UTF-8 only; converting from MARC-8
is a problem for the server, not the receiver.
Support for what I've been calling marc-hash (I like to decouple it from the
eventual JSON format in case the serialization preferences change, or at
least so implementations don't get stuck with a single JSON library) is
already baked into ruby-marc, and obviously implementations are dead-easy no
matter what the underlying language is.
Anyone from the LoC want to get in on this?
-Bill-
> [
> ...
> ]
>
> which represents a collection of MARC records or
>
> {
> ...
> }
>
> which represents a single MARC records that takes the form:
>
> {
> leader : "01192cz a2200301n 4500",
> controlfield :
> [
> { tag : "001", data : "fst01303409" },
> { tag : "003", data : "OCoLC" },
> { tag : "005", data : "20100202194747.3" },
> { tag : "008", data : "060620nn anznnbabn || ana d" }
> ],
> datafield :
> [
> {
> tag : "040",
> ind1 : " ",
> ind2 : " ",
> subfield :
> [
> { code : "a", data : "OCoLC" },
> { code : "b", data : "eng" },
> { code : "c", data : "OCoLC" },
> { code : "d", data : "OCoLC-O" },
> { code : "f", data : "fast" },
> ]
> },
> {
> tag : "151",
> ind1 : " ",
> ind2 : " ",
> subfield :
> [
> { code : "a", data : "Hawaii" },
> { code : "z", data : "Diamond Head" }
> ]
> }
> ]
> }
>
--
Bill Dueber
Library Systems Programmer
University of Michigan Library
|