Hm. You don't need to keep all 800k records in memory, you just need to
keep the data you need in memory, right? I'd keep a hash keyed by
authorized heading, with the values I need there.
I don't think you'll have trouble keeping such a hash in memory, for a
batch process run manually once in a while -- modern OS's do a great job
with virtual memory making it invisible (but slower) when you use more
memory than you have physically, if it comes to that, which it may not.
If you do, you could keep the data you need in the data store of your
choice, such as a local DBM database, which ruby/python/perl will all
let you do pretty painlessly, accessing a hash-like data structure which
is actually stored on disk not in memory but which you access more or
less the same as an in-memory hash.
But, yes, it will require some programming, for sure.
A "MARC Indexer" can mean many things, and I'm not sure you need one
here, but as it happens I have built something you could describe as a
"MARC Indexer", and I guess it wasn't exactly straightforward, it's
true. I'm not sure it's of any use to you here for your use case, but
you can check it out at https://github.com/traject-project/traject
On 11/2/14 9:29 PM, Stuart Yeates wrote:
> Do any of these have built-in indexing? 800k records isn't going to
> fit in memory and if building my own MARC indexer is 'relatively
> straightforward' then you're a better coder than I am.
>
> cheers stuart
>
> -- I have a new phone number: 04 463 5692
>
> ________________________________________ From: Code for Libraries
> <[log in to unmask]> on behalf of Jonathan Rochkind
> <[log in to unmask]> Sent: Monday, 3 November 2014 1:24 p.m. To:
> [log in to unmask] Subject: Re: [CODE4LIB] MARC reporting
> engine
>
> If you are, can become, or know, a programmer, that would be
> relatively straightforward in any programming language using the open
> source MARC processing library for that language. (ruby marc, pymarc,
> perl marc, whatever).
>
> Although you might find more trouble than you expect around
> authorities, with them being less standardized in your corpus than
> you might like. ________________________________________ From: Code
> for Libraries [[log in to unmask]] on behalf of Stuart Yeates
> [[log in to unmask]] Sent: Sunday, November 02, 2014 5:48 PM To:
> [log in to unmask] Subject: [CODE4LIB] MARC reporting engine
>
> I have ~800,000 MARC records from an indexing service
> (http://natlib.govt.nz/about-us/open-data/innz-metadata CC-BY). I am
> trying to generate:
>
> (a) a list of person authorities (and sundry metadata), sorted by how
> many times they're referenced, in wikimedia syntax
>
> (b) a view of a person authority, with all the records by which
> they're referenced, processed into a wikipedia stub biography
>
> I have established that this is too much data to process in XSLT or
> multi-line regexps in vi. What other MARC engines are there out
> there?
>
> The two options I'm aware of are learning multi-line processing in
> sed or learning enough koha to write reports in whatever their
> reporting engine is.
>
> Any advice?
>
> cheers stuart -- I have a new phone number: 04 463 5692
>
>
|