Print

Print


> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Bill Dueber
> Sent: Friday, March 05, 2010 08:48 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> On Fri, Mar 5, 2010 at 6:25 PM, Houghton,Andrew <[log in to unmask]>
> wrote:
> 
> > OK, I will bite, you stated:
> >
> > 1. That large datasets are a problem.
> > 2. That streaming APIs are a pain to deal with.
> > 3. That tool sets have memory constraints.
> >
> > So how do you propose to process large JSON datasets that:
> >
> > 1. Comply with the JSON specification.
> > 2. Can be read by any JavaScript/JSON processor.
> > 3. Do not require the use of streaming API.
> > 4. Do not exceed the memory limitations of current JSON processors.
> >
> >
> What I'm proposing is that we don't process large JSON datasets; I'm
> proposing that we process smallish JSON documents one at a time by
> pulling
> them out of a stream based on an end-of-record character.
> 
> This is basically what we use for MARC21 binary format -- have a
> defined
> structure for a valid record, and separate multiple well-formed record
> structures with an end-of-record character. This preserves JSON
> specification adherence at the record level and uses a different scheme
> to represent collections. Obviously, MARC-XML uses a different 
> mechanism to define a collection of records -- putting well-formed 
> record structures inside a <collection> tag.
> 
> So... I'm proposing define what we mean by a single MARC record
> serialized to JSON (in whatever format; I'm not very opinionated 
> on this point) that preserves the order, indicators, tags, data, 
> etc. we need to round-trip between marc21binary, marc-xml, and 
> marc-json.
> 
> And then separate those valid records with an end-of-record character
> -- "\n".

Ok, what I see here are divergent use cases and the willingness of the library community to break existing Web standards.  This is how the library community makes it more difficult to use their data and places additional barriers for people and organizations to enter their market because of these library centric protocols and standards.

If I were to try to sell this idea to the Web community, at large, and tell them that when they send an HTTP request with an Accept: application/json header to our services, our services will respond with a 200 HTTP status and deliver them malformed JSON, I would be immediately impaled with multiple arrows and daggers :(  Not to mention that OCLC would disparaged by a certain crowd in their blogs as being idiots who cannot follow standards.

OCLC's goals are use and conform to Web standards to make library data easier to use by people or organizations outside the library community, otherwise libraries and their data will become irrelevant.  The JSON serialization is a standard and the Web community expects that when they make HTTP requests with an Accept: application/json header that they will be get back JSON conforming to the standard.  JSON's main use case is in AJAX scenarios where you are not suppose to be sending megabytes of data across the wire.

Your proposal is asking me to break a widely deployed Web standard that is used by AJAX frameworks and to access millions (ok, many) Web sites.

> Unless I've read all this wrong, you've come to the conclusion that the
> benefit of having a JSON serialization that is valid JSON at both the
> record and collection level outweighs the pain of having to deal with
> a streaming parser and writer.  This allows a single collection to be
> treated as any other JSON document, which has obvious benefits (which 
> I certainly don't mean to minimize) and all the drawbacks we've been 
> talking about *ad nauseam*.

The goal is to adhere to existing Web standards and your underlying assumption is that you can or will be retrieving large datasets through an AJAX scenario.  As I pointed out this is more an API design issue and due to the way AJAX works you should never design an API in that manner.  Your assumption that you can or will be retrieving large datasets through an AJAX scenario is false given the caveat of a well designed API.  Therefore you will never be put into the scenario requiring the use of JSON streaming so your argument from this point of view is mute.

But for arguments sake let's say you could retrieve a line delimited list of JSON objects.  You can no longer use any existing AJAX framework for getting back that JSON since it's malformed.  You could use the AJAX framework's XMLHTTP to retrieve this line delimited list of JSON objects, but this still doesn't help because the XMLHTTP object will keep the entire response in memory.

So when our service sends the user agent 100MB of line delimited JSON objects, the XMLHTTP object is going to try to slurp the entire 100MB HTTP response into memory and that is going to exceed the memory requirement of the JSON/Javascrpt processor or the browser that is controlling the XMLHTTP object and the application will never get to process it one per line.

In addition, I wouldn't be surprised that whatever programming libraries or frameworks you use to read a line from the stream will have issues reading lines longer than several thousand characters which could easily be exceeded by MARC-21 records serialized into JSON.

> I go the the other way. I think the pain of dealing with a streaming
> API outweighs the benefits of having a single valid JSON structure for 
> a collection, and instead have put forward that we use a combination 
> of JSON records and a well-defined end-of-record character ("\n") to 
> represent a collection.  I recognize that this involves providing 
> special-purpose code which must call for JSON-deserialization on each 
> line, instead of being able to throw the whole stream/file/whatever 
> at your json parser is. I accept that because getting each line of a 
> text file is something I find easy compared to dealing with streaming
> parsers.

What I see here is divergent use cases:

Use case #1: retrieve a single MARC-21 format record serialized as an object according to the JSON specification.

Use case #2: retrieve a collection of MARC-21 format records serialized as an array according to the JSON specification.

Use case #3: retrieve a collection of MARC-21 format records serializing each record as an object according to the JSON specification with the restriction that all whitespace tokens are converted to spaces and each JSON object is terminated by a newline.

Personally, I have some minor issues with use case #3 in that it requires the entire serialization to be on one line.  Programming libraries and frameworks often have issues when line lengths exceed certain buffer requirements.  In addition, compressing the stream makes it difficult for humans to read when things eventually do go wrong and need human intervention.  Other alternatives to serializing the object to a single line would be to use VT (vertical tab), FF (form feed) or a double-newline to terminate the serialized objects.

Other issues with use case #3 are that this use case is primarily a file format to be read by library centric tool chains that can feed the individual objects to a JSON/Javascript processor.  Use case #3 works no differently from and provides no advantages over use case #2 in AJAX scenarios because both use cases are limited by memory constraints of the JSON/Javascript processor, e.g., if you can keep use case #2 in memory you will be able to keep use case #3 in memory.  A disadvantage to use case #3 is that it cannot use existing AJAX frameworks to deserialize JSON objects and each application must build their own infrastructure to deserialize these line delimited JSON objects.

Use cases #2 and #3 diverge because of standards compliance expectations.  So the question becomes how can use case #3 be made standards compliant?  It seems to me that use case #3 is defining a different media type than use case #1 and #2 whose media types are defined by the JSON specification.  A way to fix this issue is to say that use cases #1 and #2 conform to media type application/json and use case #3 conforms to a new media type say: application/marc+json.  This new application/marc+json media type now becomes a library centric standard and it avoids breaking a widely deployed Web standard.

Given the above discussion, use cases #1 and #2 are already defined by our MARC-JSON serialization format and meet existing standards compliance.  No changes are required by our existing specification.  Our MARC-JSON serialization for an object (MARC-21 record) could be used in use case #3 with the restriction that all whitespace tokens in serialized objects can only be spaces, given your current proposal.  Use case #3 can be satisfied by an alternate specification which defines a new media type and suggested file extension, e.g., application/marc+json and .mrj vs. application/marc and .mrc as defined by RFC 2220.


Andy.