I'd also consider using a document db (e.g. MongoDb) with the marc-in-JSON
format for this.
You could run jsonpath queries or map/reduce to get your answers.
Mongo runs best in memory, but I think you'll be fine since you don't need
immediate answers.
-Ross.
On Wednesday, February 27, 2013, Andy Kohler wrote:
> I agree with Terry: use a database. Since you're doing multiple queries,
> invest the time up front to import your data in a queryable format, with
> indexes, instead of repeatedly building comparison files.
>
> But of course, it depends... dealing with large amounts of data efficiently
> is often best done with lots of memory. But if you can run mysql and the
> lengthy up-front parsing/loading/indexing of the records is acceptable, go
> for it.
>
> For what it's worth, I have done something similar for many years, where I
> build a database with all of our MARC records, parsed down to the subfield
> level. It's great for queries like "find me all the records with XYZ in
> one subfield and ABC in another" or "find all of the duplicate OCLC
> numbers". It's not so great if you need to output the original field in a
> report (though it can be rebuilt from the subfields).
>
> Here's the Oracle table I use:
> CREATE TABLE bib_subfield
> ( record_id INT NOT NULL
> , field_seq INT NOT NULL
> , subfield_seq INT NOT NULL
> , indicators CHAR(2) NULL
> , tag CHAR(4) NOT NULL
> , subfield NVARCHAR2(4000) NULL
> )
> ;
>
> Our MARC data is Unicode, thus the NVARCHAR. Super-long subfields like
> some 5xx notes do get truncated but that's a tiny fraction of a percentage
> of data lost, a fair tradeoff for our needs.
>
> field_seq and subfield_seq are numbers tracking the ordinal position of
> each field within the record, and each subfield within a field, for those
> occasional queries wanting data from the "first" 650 field, or subfields
> which aren't in the "correct" order per catalogers. You may not need that
> level of detail.
>
> Another, completely unrelated, possible solution depending on your needs:
> run the records through solrmarc and do your queries via solr?
>
> Good luck... let us know what you eventually decide to do.
>
> --Andy
>
> On Wed, Feb 27, 2013 at 9:53 AM, Reese, Terry
> <[log in to unmask] <javascript:;>>wrote:
>
> > Kyle -- if this was me -- I'd break the file into a database. You have a
> > lot of different options, but the last time I had to do something like
> this
> > -- I broke the data into 10 tables -- a control table with a primary key
> > and oclc number, a table for 0xx fields, a table for 1xx, 2xx, etc.
> > including OCLC number and key that they relate too. You can actually do
> > this with MarcEdit (if you have mysql installed) -- but on a laptop --
> I'm
> > not going to guarantee speed with the process. Plus, the process to
> > generate the SQL data will be significant. It might take 15 hours to
> > generate the database, but then you'd have it and could create indexes on
> > it. But you could use it to create the database and then prep the files
> > for later work.
> >
> > --TR
> >
>
|