Hi Stuart,
Author of marctools[1] here – if you have any feature request that
would help you with your processing, please don't hesitate to open an
issue on github.
Originally, we wrote marctools to convert MARC to JSON and then
index[2] the output into elasticsearch[3] for random access, query and analysis.
Best,
Martin
----
[1] https://github.com/ubleipzig/marctools
[2] https://github.com/miku/esbulk
[3] http://www.elasticsearch.org/
On Tue, Nov 4, 2014 at 12:43 AM, Jonathan Rochkind <[log in to unmask]> wrote:
> Hm. You don't need to keep all 800k records in memory, you just need to keep
> the data you need in memory, right? I'd keep a hash keyed by authorized
> heading, with the values I need there.
>
> I don't think you'll have trouble keeping such a hash in memory, for a batch
> process run manually once in a while -- modern OS's do a great job with
> virtual memory making it invisible (but slower) when you use more memory
> than you have physically, if it comes to that, which it may not.
>
> If you do, you could keep the data you need in the data store of your
> choice, such as a local DBM database, which ruby/python/perl will all let
> you do pretty painlessly, accessing a hash-like data structure which is
> actually stored on disk not in memory but which you access more or less the
> same as an in-memory hash.
>
> But, yes, it will require some programming, for sure.
>
> A "MARC Indexer" can mean many things, and I'm not sure you need one here,
> but as it happens I have built something you could describe as a "MARC
> Indexer", and I guess it wasn't exactly straightforward, it's true. I'm not
> sure it's of any use to you here for your use case, but you can check it out
> at https://github.com/traject-project/traject
>
>
> On 11/2/14 9:29 PM, Stuart Yeates wrote:
>>
>> Do any of these have built-in indexing? 800k records isn't going to
>> fit in memory and if building my own MARC indexer is 'relatively
>> straightforward' then you're a better coder than I am.
>>
>> cheers stuart
>>
>> -- I have a new phone number: 04 463 5692
>>
>> ________________________________________ From: Code for Libraries
>> <[log in to unmask]> on behalf of Jonathan Rochkind
>> <[log in to unmask]> Sent: Monday, 3 November 2014 1:24 p.m. To:
>> [log in to unmask] Subject: Re: [CODE4LIB] MARC reporting
>> engine
>>
>> If you are, can become, or know, a programmer, that would be
>> relatively straightforward in any programming language using the open
>> source MARC processing library for that language. (ruby marc, pymarc,
>> perl marc, whatever).
>>
>> Although you might find more trouble than you expect around
>> authorities, with them being less standardized in your corpus than
>> you might like. ________________________________________ From: Code
>> for Libraries [[log in to unmask]] on behalf of Stuart Yeates
>> [[log in to unmask]] Sent: Sunday, November 02, 2014 5:48 PM To:
>> [log in to unmask] Subject: [CODE4LIB] MARC reporting engine
>>
>> I have ~800,000 MARC records from an indexing service
>> (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am
>> trying to generate:
>>
>> (a) a list of person authorities (and sundry metadata), sorted by how
>> many times they're referenced, in wikimedia syntax
>>
>> (b) a view of a person authority, with all the records by which
>> they're referenced, processed into a wikipedia stub biography
>>
>> I have established that this is too much data to process in XSLT or
>> multi-line regexps in vi. What other MARC engines are there out
>> there?
>>
>> The two options I'm aware of are learning multi-line processing in
>> sed or learning enough koha to write reports in whatever their
>> reporting engine is.
>>
>> Any advice?
>>
>> cheers stuart -- I have a new phone number: 04 463 5692
>>
>>
>
|