Print

Print


> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Benjamin Young
> Sent: Monday, March 08, 2010 09:32 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Q: XML2JSON converter
> 
> Rather than using a newline-delimited format (the whole of which would
> not together be considered a valid JSON object) why not use the JSON
> array format with or without new lines? Something like:
> 
> [{"key":"value"}, {"key","value"}]
> 
> You could include new line delimiters after the "," if you needed to
> make pre-parsing easier (in a streaming context), but may be able to
> get
> away with just looking for the next "," or "]" after each valid JSON
> object.
> 
> That would allow the entire stream, if desired, to be saved to disk and
> read in as a single JSON object, or the same API to serve smaller JSON
> collections in a JSON standard way.

I think we just went around full circle again.  There appear to be two distinct use cases when dealing with MARC collections.  The first conforms to the ECMA 262 JSON subset.  Which is what you described, above:

[ { "key" : "value" }, { "key" : "value" } ]

its media type should be specified as application/json.

The second use case, which there was some discussion between Bill Dueber and myself, is a newline delimited format where the JSON array specifiers are omitted and the objects are specified one per line without commas separating objects.  The misunderstanding between Bill and I was that this "malformed" JSON was being sent as media type application/json which is not what he was proposing and I misunderstood.  This newline delimited JSON appears to be an import/export format in both CouchDB and MongoDB.

In the FAST work I'm doing I'm probably going to take an alternate approach to generating our 10,000 MARC record collection files for download.  The approach I'm going to take is to create valid JSON but make it easier for the CouchDB and MongoDB folks to import the collection of records.  The format will be:

[
{ "key" : "value" }
,
{ "key" : "value" }
]

the objects will be one per line, but the array specifier and comma delimiters between objects will appear on a separate line.  This would allow the CouchDB and MongoDB folks to run a simple sed script on the file before import:

sed -e '/^.$/D' file.json > file.txt

or if they are reading the data as a raw text file, they can just ignore all lines that start with opening brace, comma, or closing brace, or alternately only process lines starting with an opening brace.

However, this doesn't mean that I'm balking on pursuing a separate media type specific to the library community that specifies a specific MARC JSON serialization encoded as a single line.

I see multiple steps here with the first being a consensus on serializing MARC (ISO 2709) in JSON.  Which begins with me documenting it so people can throw some darts at.  I don't think what we are proposing is controversial, but it's beneficial to have a variety of perspectives as input.


Andy.