I actually just wrote the same exact email as Bill (although probably not as polite -- I called the marcxml "collection" element a "contrivance that appears nowhere in marc21"). I even wrote the "marc21 is EOR character delimited files" bit. I was hoping to figure out how to use unix split to make my point, couldn't, and then discarded my draft. But I was *right there*. -Ross. On Fri, Mar 5, 2010 at 8:48 PM, Bill Dueber <[log in to unmask]> wrote: > On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <[log in to unmask]> wrote: > >> OK, I will bite, you stated: >> >> 1. That large datasets are a problem. >> 2. That streaming APIs are a pain to deal with. >> 3. That tool sets have memory constraints. >> >> So how do you propose to process large JSON datasets that: >> >> 1. Comply with the JSON specification. >> 2. Can be read by any JavaScript/JSON processor. >> 3. Do not require the use of streaming API. >> 4. Do not exceed the memory limitations of current JSON processors. >> >> > What I'm proposing is that we don't process large JSON datasets; I'm > proposing that we process smallish JSON documents one at a time by pulling > them out of a stream based on an end-of-record character. > > This is basically what we use for MARC21 binary format -- have a defined > structure for a valid record, and separate multiple well-formed record > structures with an end-of-record character. This preserves JSON > specification adherence at the record level and uses a different scheme to > represent collections. Obviously, MARC-XML uses a different mechanism to > define a collection of records -- putting well-formed record structures > inside a <collection> tag. > > So... I'm proposing define what we mean by a single MARC record serialized > to JSON (in whatever format; I'm not very opinionated on this point) that > preserves the order, indicators, tags, data, etc. we need to round-trip > between marc21binary, marc-xml, and marc-json. > > And then separate those valid records with an end-of-record character -- > "\n". > > Unless I've read all this wrong, you've come to the conclusion that the > benefit of having a JSON serialization that is valid JSON at both the record > and collection level outweighs the pain of having to deal with a streaming > parser and writer. This allows a single collection to be treated as any > other JSON document, which has obvious benefits (which I certainly don't > mean to minimize) and all the drawbacks we've been talking about *ad nauseam > *. > > I go the the other way. I think the pain of dealing with a streaming API > outweighs the benefits of having a single valid JSON structure for a > collection, and instead have put forward that we use a combination of JSON > records and a well-defined end-of-record character ("\n") to represent a > collection. I recognize that this involves providing special-purpose code > which must call for JSON-deserialization on each line, instead of being able > to throw the whole stream/file/whatever at your json parser is. I accept > that because getting each line of a text file is something I find easy > compared to dealing with streaming parsers. > > And our point of disagreement, I think, is that I believe that defining the > collection structure in such a way that we need two steps (get a line; > deserialize that line) and can't just call the equivalent of > JSON.parse(stream) has benefits in ease of implementation and use that > outweigh the loss of having both a single record and a collection of records > be valid JSON. And you, I think, don't :-) > > I'm going to bow out of this now, unless I've got some part of our positions > wrong, to let any others that care (which may number zero) chime in. > > -Bill- > > > > > > > > > > > -- > Bill Dueber > Library Systems Programmer > University of Michigan Library >