On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew <[log in to unmask]> wrote: > Too bad I didn't attend code4lib. OCLC Research has created a version of > MARC in JSON and will probably release FAST concepts in MARC binary, > MARC-XML and our MARC-JSON format among other formats. I'm wondering > whether there is some consensus that can be reached and standardized at LC's > level, just like OCLC, RLG and LC came to consensus on MARC-XML. > Unfortunately, I have not had the time to document the format, although it > fairly straight forward, and yes we have an XSLT to convert from MARC-XML to > MARC-JSON. Basically the format I'm using is: > > The stuff I've been doing: http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ ... is pretty much the same, except: 1. I don't explicitly split up control and data fields. There's a single field list; an item that has two elements is a control field (tag/data); one with four is a data field (tag / ind1 /ind2 / array_of_subfield) 2. Instead of putting a collection in a big json array, I use newline-delimited-json (basically, just stick one record on each line as a single json hash). This has the advantage that it makes streaming much, much easier, and makes doing some other things (e.g., grab the first record or two) much cheaper for even the dumbest json parser). I'm not sure what the state of JSON streaming parsers are; I know Jackson (for Java) can do it, and perl's JSON::XS can...kind of...but it's not great. 3. I include a "type" (MARC-JSON, MARC-HASH, whatever) and version: [major, minor] in each record. There's already a ton of JSON floating around the library world; labeling what the heck a structure is is just friendly :-) MARC's structure is dumb enough that we collectively basically can't miss; there's only so much you can do with the stuff, and a round-trip to JSON and back is easy to implement. I'm not super-against explicitly labeling the data elements (tag:, :ind1:, etc.) but I don't see where it's necessary unless you're planning on adding out-of-band data to the records/fields/subfields at some point. Which might be kinda cool (e.g., language hints on a per-subfield basis? Tokenization hints for non-whitespace-delimited languages? URIs for unique concepts and authorities where they exist for easy creation of RDF?) I *am*, however, willing to push and push and push for NDJ instead of having to deal with streaming JSON parsing, which to my limited understanding is hard to get right and to my more qualified understanding is a pain in the ass to work with. And anything we do should explicitly be UTF-8 only; converting from MARC-8 is a problem for the server, not the receiver. Support for what I've been calling marc-hash (I like to decouple it from the eventual JSON format in case the serialization preferences change, or at least so implementations don't get stuck with a single JSON library) is already baked into ruby-marc, and obviously implementations are dead-easy no matter what the underlying language is. Anyone from the LoC want to get in on this? -Bill- > [ > ... > ] > > which represents a collection of MARC records or > > { > ... > } > > which represents a single MARC records that takes the form: > > { > leader : "01192cz a2200301n 4500", > controlfield : > [ > { tag : "001", data : "fst01303409" }, > { tag : "003", data : "OCoLC" }, > { tag : "005", data : "20100202194747.3" }, > { tag : "008", data : "060620nn anznnbabn || ana d" } > ], > datafield : > [ > { > tag : "040", > ind1 : " ", > ind2 : " ", > subfield : > [ > { code : "a", data : "OCoLC" }, > { code : "b", data : "eng" }, > { code : "c", data : "OCoLC" }, > { code : "d", data : "OCoLC-O" }, > { code : "f", data : "fast" }, > ] > }, > { > tag : "151", > ind1 : " ", > ind2 : " ", > subfield : > [ > { code : "a", data : "Hawaii" }, > { code : "z", data : "Diamond Head" } > ] > } > ] > } > -- Bill Dueber Library Systems Programmer University of Michigan Library