On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <[log in to unmask]> wrote:
> OK, I will bite, you stated:
> 1. That large datasets are a problem.
> 2. That streaming APIs are a pain to deal with.
> 3. That tool sets have memory constraints.
> So how do you propose to process large JSON datasets that:
> 1. Comply with the JSON specification.
> 3. Do not require the use of streaming API.
> 4. Do not exceed the memory limitations of current JSON processors.
What I'm proposing is that we don't process large JSON datasets; I'm
proposing that we process smallish JSON documents one at a time by pulling
them out of a stream based on an end-of-record character.
This is basically what we use for MARC21 binary format -- have a defined
structure for a valid record, and separate multiple well-formed record
structures with an end-of-record character. This preserves JSON
specification adherence at the record level and uses a different scheme to
represent collections. Obviously, MARC-XML uses a different mechanism to
define a collection of records -- putting well-formed record structures
inside a <collection> tag.
So... I'm proposing define what we mean by a single MARC record serialized
to JSON (in whatever format; I'm not very opinionated on this point) that
preserves the order, indicators, tags, data, etc. we need to round-trip
between marc21binary, marc-xml, and marc-json.
And then separate those valid records with an end-of-record character --
Unless I've read all this wrong, you've come to the conclusion that the
benefit of having a JSON serialization that is valid JSON at both the record
and collection level outweighs the pain of having to deal with a streaming
parser and writer. This allows a single collection to be treated as any
other JSON document, which has obvious benefits (which I certainly don't
mean to minimize) and all the drawbacks we've been talking about *ad nauseam
I go the the other way. I think the pain of dealing with a streaming API
outweighs the benefits of having a single valid JSON structure for a
collection, and instead have put forward that we use a combination of JSON
records and a well-defined end-of-record character ("\n") to represent a
collection. I recognize that this involves providing special-purpose code
which must call for JSON-deserialization on each line, instead of being able
to throw the whole stream/file/whatever at your json parser is. I accept
that because getting each line of a text file is something I find easy
compared to dealing with streaming parsers.
And our point of disagreement, I think, is that I believe that defining the
collection structure in such a way that we need two steps (get a line;
deserialize that line) and can't just call the equivalent of
JSON.parse(stream) has benefits in ease of implementation and use that
outweigh the loss of having both a single record and a collection of records
be valid JSON. And you, I think, don't :-)
I'm going to bow out of this now, unless I've got some part of our positions
wrong, to let any others that care (which may number zero) chime in.
Library Systems Programmer
University of Michigan Library