I was a strong proponent of NDJ at one point, but I've grown less strident
and more weary since then.
Brad Baxter has a good overview of some options[1]. I'm assuming it's a
given we'd all prefer to work with valid JSON files if the pain-point can
be brought down far enough.
A couple years have passed since we first talked about this stuff, and the
state of JSON pull-parsers is better than it once was:
* yajl[2] is a super-fast C library for parsing json and support stream
parsing. Bindings for ruby, node, python, and perl are linked to off the
home page. I found one PHP binding[3] on github which is broken/abandoned,
and no other pull-parser for PHP that I can find. Sadly, the ruby wrapper
doesn't actually expose the callbacks necessary for pull-parsing, although
there is a pull request[4] and at least one other option[5].
* Perl's JSON::XS supports incremental parsing
* the Jackson java library[6] is excellent and has an easy-to-use
pull-parser. There are a few simplistic efforts to wrap it for jruby/jython
use as well.
Pull-parsing is ugly, but no longer astoundingly difficult or slow, with
the possible exception of PHP. And output is simple enough.
As much as it makes me shudder, I think we're probably better off trying to
do pull parsers and have a marc-in-json document be a valid JSON array.
We could easily adopt a *convention* of, essentially, one-record-per-line,
but wrap it in '[]' to make it valid json. That would allow folks with a
pull-parser to write a real streaming reader, and folks without to "cheat"
(ditch the leading and trailing [], and read the rest as
one-record-per-line) until such a time as they can start using a more
full-featured json parser.
1.
http://en.wikipedia.org/wiki/User:Baxter.brad/Drafts/JSON_Document_Streaming_Proposal
2. http://lloyd.github.com/yajl/
3. https://github.com/sfalvo/php-yajl
4. https://github.com/brianmario/yajl-ruby/pull/50
5. http://dgraham.github.com/json-stream/
6. http://wiki.fasterxml.com/JacksonHome
On Thu, Dec 1, 2011 at 12:56 PM, Michael B. Klein <[log in to unmask]> wrote:
> +1 to marc-in-json
> +1 to newline-delimited records
> +1 to read support
> +1 to edsu, rsinger, BillDueber, gmcharlt, and the other module maintainers
>
> On Thu, Dec 1, 2011 at 9:31 AM, Keith Jenkins <[log in to unmask]> wrote:
>
> > On Thu, Dec 1, 2011 at 11:56 AM, Gabriel Farrell <[log in to unmask]>
> > wrote:> I suspect newline-delimited will win this race.
> > Yes. Everyone please cast a vote for newline-delimited JSON.
> >
> > Is there any consensus on the appropriate mime type for ndj?
> >
> > Keith
> >
>
--
Bill Dueber
Library Systems Programmer
University of Michigan Library
|