Print

Print


Hi Fernando,

I have started my experience with MARC in Mongo. I have import ~6 million
MARC records (auth and bib) to MongoDB. The steps I took:

1) the source was MARCXML I created with XC OAI Toolkit.
2) I created an XSLT file which creates MARC-JSON from MARCXML
I followed the MARC-JSON draft and not Bill Dueber's MARC HASH
http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The
conversion is not 100% perfect, but from the 6 million records only 20
were converted with some errors, which is enogh error rate for a home
made project.
3) imported the files
4) indexed the files

Lessons learned:
- the import process is moch more quicker than any other part of the 
workflow.
The 6 million records was imported about 30 minutes, while indexing took
3 hours.
- count() is very slow method for complex queries even after intensive 
indexing.
but iterating over the results is more quicker.
- there is no way to index part of strings (e.g. splitting the leader or 
006/007/008
fields)
- full text search is not too quick
- before indexing the size of the index was 9 GB, after full index it was 28 
GB
(I should note, that on 32-bit operation system the max size of mongo index
is 2 GB).

Conclusions:
- the MARC-JSON format is good for data exchange, but it is not enough 
precise
for searching, since - MARC heritage - distinct information are combined 
together to
single fields (Leader, 008 etc). We should split them into smaller 
information chunks
before indexing.
- I should learn more about the possibilities of MongoDB

I can give you more technical details, if you interested.

Péter
eXtensible Catalog


----- Original Message ----- 
From: "Fernando Gómez" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?


> There's been some talk in code4lib about using MongoDB to store MARC
> records in some kind of JSON format. I'd like to know if you have
> experimented with indexing those documents in MongoDB. From my limited
> exposure to MongoDB, it seems difficult, unless MongoDB supports some
> kind of "custom indexing" functionality.
>
> According to the MongoDB docs [1], "you can create an index by calling
> the ensureIndex() function, and providing a document that specifies
> one or more keys to index." Examples of this are:
>
>    db.things.ensureIndex({"city": 1})
>    db.things.ensureIndex({"address.city": 1})
>
> That is, you specify the keys giving a path from the root of the
> document to the data element you are interested in. Such a path acts
> both as the index's name, and as an specification of how to get the
> keys's values.
>
> In the case of two proposed MARC-JSON formats [2, 3], I can't see such
> "path". For example, say you want an index on field 001. Simplifying,
> the JSON docs would look like this
>
>    { "fields" : [ ["001", "001 value"], ... ] }
>
> or this
>
>    { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] }
>
> How would you specify field 001 to MongoDB?
>
> It would be nice to have some kind of custom indexing, where one could
> provide an index name and separately a JavaScript function specifying
> how to obtain the keys's values for that index.
>
> Any suggestions? Do other document oriented databases offer a better
> solution for this?
>
>
> BTW, I fed MongoDB with the example MARC records in [2] and [3], and
> it choked on them. Both are missing some commas :-)
>
>
> [1] http://www.mongodb.org/display/DOCS/Indexes
> [2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
> [3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11
>
>
> -- 
> Fernando Gómez
> Biblioteca "Antonio Monteiro"
> INMABB (Conicet / Universidad Nacional del Sur)
> Av. Alem 1253
> B8000CPB Bahía Blanca, Argentina
> Tel. +54 (291) 459 5116
> http://inmabb.criba.edu.ar/
>