The MARC XML seemed to be an archive within an archive - I had to gunzip to get innzmetadata.xml then rename to innzmetadata.xml.gz and gunzip again to get the actual xml
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936
> On 3 Nov 2014, at 22:38, Robert Haschart <[log in to unmask]> wrote:
>
> I was going to echo Eric Hatcher's recommendation of Solr and SolrMarc, since I'm the creator of SolrMarc.
> It does provide many of the same tools as are described in the toolset you linked to, but it is designed to write to Solr rather than to a SQL style database. Solr may or may not be more suitable for your needs then a SQL database. However I decided to download the data to see whether SolrMarc could handle it. I started with the MARCXML.gz data, ungzipped it to get a .XML file, but the resulting file causes SolrMarc to blow chunks. Either I'm missing something or there is something way wrong with that data. The gzipped binary MARC file work fine with the SolrMarc tools.
>
> Creating a SolrMarc script to extract the 700 fields, plus a bash script to cluster and count them, and sort by frequency took about 20 minutes.
>
> -Bob Haschart
>
>
> On 11/3/2014 3:00 PM, Stuart Yeates wrote:
>> Thank you to all who responded with software suggestions. https://github.com/ubleipzig/marctools is looking like the most promising candidate so far. The more I read through the recommendations the more it dawned on me that I don't want to have to configure yet another java toolchain (yes I know, that may be personal bias).
>>
>> Thank you to all who responded about the challenges of authority control in such collections. I'm aware of these issues. The current project is about marshalling resources for editors to make informed decisions about rather than automating the creation of articles, because there is human judgement involved in the last step I can afford to take a few authority control 'risks'
>>
>> cheers
>> stuart
>>
>> --
>> I have a new phone number: 04 463 5692
>>
>> ________________________________________
>> From: Code for Libraries<[log in to unmask]> on behalf of raffaele messuti<[log in to unmask]>
>> Sent: Monday, 3 November 2014 11:39 p.m.
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] MARC reporting engine
>>
>> Stuart Yeates wrote:
>>> Do any of these have built-in indexing? 800k records isn't going to fit in memory and if building my own MARC indexer is 'relatively straightforward' then you're a better coder than I am.
>> you could try marcdb[1] from marctools[2]
>>
>> [1] https://github.com/ubleipzig/marctools#marcdb
>> [2] https://github.com/ubleipzig/marctools
>>
>>
>> --
>> raffaele
|