LISTSERV 16.5 - CODE4LIB Archives

On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <[log in to unmask]> wrote:

> OK, I will bite, you stated:
>
> 1. That large datasets are a problem.
> 2. That streaming APIs are a pain to deal with.
> 3. That tool sets have memory constraints.
>
> So how do you propose to process large JSON datasets that:
>
> 1. Comply with the JSON specification.
> 2. Can be read by any JavaScript/JSON processor.
> 3. Do not require the use of streaming API.
> 4. Do not exceed the memory limitations of current JSON processors.
>
>
What I'm proposing is that we don't process large JSON datasets; I'm
proposing that we process smallish JSON documents one at a time by pulling
them out of a stream based on an end-of-record character.

This is basically what we use for MARC21 binary format -- have a defined
structure for a valid record, and separate multiple well-formed record
structures with an end-of-record character. This preserves JSON
specification adherence at the record level and uses a different scheme to
represent collections. Obviously, MARC-XML uses a different mechanism to
define a collection of records -- putting well-formed record structures
inside a <collection> tag.

So... I'm proposing define what we mean by a single MARC record serialized
to JSON (in whatever format; I'm not very opinionated on this point) that
preserves the order, indicators, tags, data, etc. we need to round-trip
between marc21binary, marc-xml, and marc-json.

And then separate those valid records with an end-of-record character --
"\n".

Unless I've read all this wrong, you've come to the conclusion that the
benefit of having a JSON serialization that is valid JSON at both the record
and collection level outweighs the pain of having to deal with a streaming
parser and writer.  This allows a single collection to be treated as any
other JSON document, which has obvious benefits (which I certainly don't
mean to minimize) and all the drawbacks we've been talking about *ad nauseam
*.

I go the the other way. I think the pain of dealing with a streaming API
outweighs the benefits of having a single valid JSON structure for a
collection, and instead have put forward that we use a combination of JSON
records and a well-defined end-of-record character ("\n") to represent a
collection.  I recognize that this involves providing special-purpose code
which must call for JSON-deserialization on each line, instead of being able
to throw the whole stream/file/whatever at your json parser is. I accept
that because getting each line of a text file is something I find easy
compared to dealing with streaming parsers.

And our point of disagreement, I think, is that I believe that defining the
collection structure in such a way that we need two steps (get a line;
deserialize that line) and can't just call the equivalent of
JSON.parse(stream) has benefits in ease of implementation and use that
outweigh the loss of having both a single record and a collection of records
be valid JSON. And you, I think, don't :-)

I'm going to bow out of this now, unless I've got some part of our positions
wrong, to let any others that care (which may number zero) chime in.

 -Bill-










-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library