LISTSERV 16.5 - CODE4LIB Archives

On 3/6/10 6:59 PM, Houghton,Andrew wrote:
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> Bill Dueber
>> Sent: Saturday, March 06, 2010 05:11 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] Q: XML2JSON converter
>>
>> Anyway, hopefully, it won't be a huge surprise that I don't disagree
>> with any of the quote above in general; I would assert, though, that
>> application/json and application/marc+json should both return JSON
>> (in the same way that text/xml, application/xml, and
>> application/marc+xml can all be expected to return XML).
>> Newline-delimited json is starting to crop up in a few places
>> (e.g. couchdb) and should probably have its own mime type
>> and associated extension. So I would say something like:
>>
>> application/json -- return json (obviously)
>> application/marc+json  -- return json
>> application/marc+ndj  -- return newline-delimited json
>>      
> This sounds like consensus on how to deal with newline-delimited JSON in a standards based manner.
>
> I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll have to dig into how they deal with this newline-delimited JSON.  Can you provide any references to get me started?
>    
Rather than using a newline-delimited format (the whole of which would 
not together be considered a valid JSON object) why not use the JSON 
array format with or without new lines? Something like:

[{"key":"value"}, {"key","value"}]

You could include new line delimiters after the "," if you needed to 
make pre-parsing easier (in a streaming context), but may be able to get 
away with just looking for the next "," or "]" after each valid JSON object.

That would allow the entire stream, if desired, to be saved to disk and 
read in as a single JSON object, or the same API to serve smaller JSON 
collections in a JSON standard way.

CouchDB uses this array notation when returning multiple document 
revisions in one request. CouchDB also offers a slightly more annotated 
structure (which might be useful with streaming as well):

{
   "total_rows": 2,
   "offset": 0,
   "rows":[{"key":"value"}, {"key","value"}]
}

Rows here plays the same roll as the above array-based format, but 
provides an initial row count for the consumer to use (if it wants) for 
knowing what's ahead. The "offset" key is specific to CouchDB, but 
similar application specific information could be stored in the "header" 
of the JSON object using this method.
>> In all cases, we should agree on a standard record serialization,
>> though, and the pure-json returns should include something that
>> indicates what the heck it is (hopefully a URI that can act as a
>> distinct "namespace"-type identifier, including a version in it).
>>      
> I agree that our MARC-JSON serialization needs some "namespace" identifier in it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where they can have from 1-9 indicator values.
>
> BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON standard, which means you need to use \uXXXX notation to specify characters in strings, not sure I made that clear in earlier posts.  A downside to the current ECMA 262 specification is that it doesn't support \U00XXXXXX, as Python does, for the extended characters.  Hopefully that will get rectified in a future ECMA 262 specification.
>
>    
>> The question for me, I think, is whether within this community,  anyone
>> who provides one of these types (application/marc+json and
>> application/marc+ndj) should automatically be expected to provide both.
>> I don't have an answer for that.
>>      
As far as mime-type declarations go in general, I'd recommend avoiding 
any format specific mime types and sticking to the application/json 
format and providing document level hints (if needed) for the content 
type. If you do find a need for the special case mime types, I'd 
recommend still responding to Accepts: application/json whenever 
possible--for the sake of standards. :)

All told, I'm just glad to see this discussion being had. I'll be happy 
to provide some CouchDB test cases (replication, etc) if that's of 
interest to anyone.

Thanks,
Benjamin
> I think this issue gets into familiar territory when dealing with RDF formats.  Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of them?  No, but it's nice of the server to at least provide NT or Turtle and XML.  Ultimately it's up to the server.  But the only difference between use cases #2 and #3 is whether the output is wrapped in an array, so it's probably easy for the server to produce both.
>
> Depending on how much time I get next week I'll talk with the developer network folks to see what I need to do to put a specification under their infrastructure.  Looks like from my schedule it's going to be another week of hell :(
>
>
> Andy.
>