Hi Fernando, I have started my experience with MARC in Mongo. I have import ~6 million MARC records (auth and bib) to MongoDB. The steps I took: 1) the source was MARCXML I created with XC OAI Toolkit. 2) I created an XSLT file which creates MARC-JSON from MARCXML I followed the MARC-JSON draft and not Bill Dueber's MARC HASH http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The conversion is not 100% perfect, but from the 6 million records only 20 were converted with some errors, which is enogh error rate for a home made project. 3) imported the files 4) indexed the files Lessons learned: - the import process is moch more quicker than any other part of the workflow. The 6 million records was imported about 30 minutes, while indexing took 3 hours. - count() is very slow method for complex queries even after intensive indexing. but iterating over the results is more quicker. - there is no way to index part of strings (e.g. splitting the leader or 006/007/008 fields) - full text search is not too quick - before indexing the size of the index was 9 GB, after full index it was 28 GB (I should note, that on 32-bit operation system the max size of mongo index is 2 GB). Conclusions: - the MARC-JSON format is good for data exchange, but it is not enough precise for searching, since - MARC heritage - distinct information are combined together to single fields (Leader, 008 etc). We should split them into smaller information chunks before indexing. - I should learn more about the possibilities of MongoDB I can give you more technical details, if you interested. Péter eXtensible Catalog ----- Original Message ----- From: "Fernando Gómez" <[log in to unmask]> To: <[log in to unmask]> Sent: Thursday, May 13, 2010 2:59 PM Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB? > There's been some talk in code4lib about using MongoDB to store MARC > records in some kind of JSON format. I'd like to know if you have > experimented with indexing those documents in MongoDB. From my limited > exposure to MongoDB, it seems difficult, unless MongoDB supports some > kind of "custom indexing" functionality. > > According to the MongoDB docs [1], "you can create an index by calling > the ensureIndex() function, and providing a document that specifies > one or more keys to index." Examples of this are: > > db.things.ensureIndex({"city": 1}) > db.things.ensureIndex({"address.city": 1}) > > That is, you specify the keys giving a path from the root of the > document to the data element you are interested in. Such a path acts > both as the index's name, and as an specification of how to get the > keys's values. > > In the case of two proposed MARC-JSON formats [2, 3], I can't see such > "path". For example, say you want an index on field 001. Simplifying, > the JSON docs would look like this > > { "fields" : [ ["001", "001 value"], ... ] } > > or this > > { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] } > > How would you specify field 001 to MongoDB? > > It would be nice to have some kind of custom indexing, where one could > provide an index name and separately a JavaScript function specifying > how to obtain the keys's values for that index. > > Any suggestions? Do other document oriented databases offer a better > solution for this? > > > BTW, I fed MongoDB with the example MARC records in [2] and [3], and > it choked on them. Both are missing some commas :-) > > > [1] http://www.mongodb.org/display/DOCS/Indexes > [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ > [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11 > > > -- > Fernando Gómez > Biblioteca "Antonio Monteiro" > INMABB (Conicet / Universidad Nacional del Sur) > Av. Alem 1253 > B8000CPB Bahía Blanca, Argentina > Tel. +54 (291) 459 5116 > http://inmabb.criba.edu.ar/ >