Print

Print


I actually just wrote the same exact email as Bill (although probably
not as polite -- I called the marcxml "collection" element a
"contrivance that appears nowhere in marc21").  I even wrote the
"marc21 is EOR character delimited files" bit.  I was hoping to figure
out how to use unix split to make my point, couldn't, and then
discarded my draft.

But I was *right there*.

-Ross.

On Fri, Mar 5, 2010 at 8:48 PM, Bill Dueber <[log in to unmask]> wrote:
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <[log in to unmask]> wrote:
>
>> OK, I will bite, you stated:
>>
>> 1. That large datasets are a problem.
>> 2. That streaming APIs are a pain to deal with.
>> 3. That tool sets have memory constraints.
>>
>> So how do you propose to process large JSON datasets that:
>>
>> 1. Comply with the JSON specification.
>> 2. Can be read by any JavaScript/JSON processor.
>> 3. Do not require the use of streaming API.
>> 4. Do not exceed the memory limitations of current JSON processors.
>>
>>
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by pulling
> them out of a stream based on an end-of-record character.
>
> This is basically what we use for MARC21 binary format -- have a defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme to
> represent collections. Obviously, MARC-XML uses a different mechanism to
> define a collection of records -- putting well-formed record structures
> inside a <collection> tag.
>
> So... I'm proposing define what we mean by a single MARC record serialized
> to JSON (in whatever format; I'm not very opinionated on this point) that
> preserves the order, indicators, tags, data, etc. we need to round-trip
> between marc21binary, marc-xml, and marc-json.
>
> And then separate those valid records with an end-of-record character --
> "\n".
>
> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the record
> and collection level outweighs the pain of having to deal with a streaming
> parser and writer.  This allows a single collection to be treated as any
> other JSON document, which has obvious benefits (which I certainly don't
> mean to minimize) and all the drawbacks we've been talking about *ad nauseam
> *.
>
> I go the the other way. I think the pain of dealing with a streaming API
> outweighs the benefits of having a single valid JSON structure for a
> collection, and instead have put forward that we use a combination of JSON
> records and a well-defined end-of-record character ("\n") to represent a
> collection.  I recognize that this involves providing special-purpose code
> which must call for JSON-deserialization on each line, instead of being able
> to throw the whole stream/file/whatever at your json parser is. I accept
> that because getting each line of a text file is something I find easy
> compared to dealing with streaming parsers.
>
> And our point of disagreement, I think, is that I believe that defining the
> collection structure in such a way that we need two steps (get a line;
> deserialize that line) and can't just call the equivalent of
> JSON.parse(stream) has benefits in ease of implementation and use that
> outweigh the loss of having both a single record and a collection of records
> be valid JSON. And you, I think, don't :-)
>
> I'm going to bow out of this now, unless I've got some part of our positions
> wrong, to let any others that care (which may number zero) chime in.
>
>  -Bill-
>
>
>
>
>
>
>
>
>
>
> --
> Bill Dueber
> Library Systems Programmer
> University of Michigan Library
>