I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc,
since I'm the creator of SolrMarc.
It does provide many of the same tools as are described in the toolset
you linked to, but it is designed to write to Solr rather than to a SQL
style database. Solr may or may not be more suitable for your needs
then a SQL database. However I decided to download the data to see
whether SolrMarc could handle it. I started with the MARCXML.gz data,
ungzipped it to get a .XML file, but the resulting file causes SolrMarc
to blow chunks. Either I'm missing something or there is something way
wrong with that data. The gzipped binary MARC file work fine with the
Creating a SolrMarc script to extract the 700 fields, plus a bash script
to cluster and count them, and sort by frequency took about 20 minutes.
On 11/3/2014 3:00 PM, Stuart Yeates wrote:
> Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias).
> Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks'
> I have a new phone number: 04 463 5692
> From: Code for Libraries<[log in to unmask]> on behalf of raffaele messuti<[log in to unmask]>
> Sent: Monday, 3 November 2014 11:39 p.m.
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] MARC reporting engine
> Stuart Yeates wrote:
>> Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am.
> you could try marcdb from marctools
>  https://github.com/ubleipzig/marctools#marcdb
>  https://github.com/ubleipzig/marctools