Print

Print


> There's been some talk in code4lib about using MongoDB to store MARC
> records in some kind of JSON format. I'd like to know if you have
> experimented with indexing those documents in MongoDB. From my limited
> exposure to MongoDB, it seems difficult, unless MongoDB supports some
> kind of "custom indexing" functionality.

First things first : it depends on what kind of "indexing" you're looking to do — I haven't worked with CouchDB (yet), but I have with MongoDB, and although it's a great (and fast) data store, it has a "basic" style of indexing as SQL databases.  That is, you can do exact-match, some simple regex (usually left-anchored) and then of course all the power of map/reduce (Mongo does map/reduce as well as Couch).

Doing funkier full-text indexing is one of the priorities for upcoming MongoDB development, as I understand.  In the interim, it might be worth having a look at ElasticSearch: http://www.elasticsearch.com/ — It's based on Lucene and has its own DSL to support fuzzy querying.  I've been playing with it and it seems like a smart NoSQL implementation, albeit subtly different from Mongo or Couch.

>    { "fields" : [ ["001", "001 value"], ... ] }
> 
> or this
> 
>    { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] }
> 
> How would you specify field 001 to MongoDB?

I think you would do this using dot notation, eg.  db.records.find( { "controlfield.tag" : "001" } )

But I don't know enough about MARC-in-JSON to say exactly.  Have a look at:

http://www.mongodb.org/display/DOCS/Dot+Notation+%28Reaching+into+Objects%29

> It would be nice to have some kind of custom indexing, where one could
> provide an index name and separately a JavaScript function specifying
> how to obtain the keys's values for that index.
> 
> Any suggestions? Do other document oriented databases offer a better
> solution for this?

My understanding is that indexes, in MongoDB at least, operate much like they do in SQL RDBMS — that is, they are used to pre-hash field values for performance, rather than having to be explicitly defined.  ie. I *believe* if you don't explicitly do an ensureIndex() on a field, you can still query it, but it'll be slower.  But I may be wrong.

> BTW, I fed MongoDB with the example MARC records in [2] and [3], and
> it choked on them. Both are missing some commas :-)
> 
> [1] http://www.mongodb.org/display/DOCS/Indexes
> [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11

Not to start a flame war, but from my point of view, it seems rather strange for us to go through all this learning of new technology only to stuff MARC into it.  That's not to say it can't be done, or there aren't valid use cases for doing such a thing, but just that it seems like an odd juxtaposition.

I realize this is a bit at odds with my evangelizing at C4LN on "merging old and new", but really, being limited to the MARC data model with all the flexibility of NoSQL seems kind of like having a Ferarri and then setting the speed limiter at 50km/h.  Fun to drive, I _suppose_.

MJ