I agree with Terry: use a database. Since you're doing multiple queries,
invest the time up front to import your data in a queryable format, with
indexes, instead of repeatedly building comparison files.
But of course, it depends... dealing with large amounts of data efficiently
is often best done with lots of memory. But if you can run mysql and the
lengthy up-front parsing/loading/indexing of the records is acceptable, go
For what it's worth, I have done something similar for many years, where I
build a database with all of our MARC records, parsed down to the subfield
level. It's great for queries like "find me all the records with XYZ in
one subfield and ABC in another" or "find all of the duplicate OCLC
numbers". It's not so great if you need to output the original field in a
report (though it can be rebuilt from the subfields).
Here's the Oracle table I use:
CREATE TABLE bib_subfield
( record_id INT NOT NULL
, field_seq INT NOT NULL
, subfield_seq INT NOT NULL
, indicators CHAR(2) NULL
, tag CHAR(4) NOT NULL
, subfield NVARCHAR2(4000) NULL
Our MARC data is Unicode, thus the NVARCHAR. Super-long subfields like
some 5xx notes do get truncated but that's a tiny fraction of a percentage
of data lost, a fair tradeoff for our needs.
field_seq and subfield_seq are numbers tracking the ordinal position of
each field within the record, and each subfield within a field, for those
occasional queries wanting data from the "first" 650 field, or subfields
which aren't in the "correct" order per catalogers. You may not need that
level of detail.
Another, completely unrelated, possible solution depending on your needs:
run the records through solrmarc and do your queries via solr?
Good luck... let us know what you eventually decide to do.
On Wed, Feb 27, 2013 at 9:53 AM, Reese, Terry
<[log in to unmask]>wrote:
> Kyle -- if this was me -- I'd break the file into a database. You have a
> lot of different options, but the last time I had to do something like this
> -- I broke the data into 10 tables -- a control table with a primary key
> and oclc number, a table for 0xx fields, a table for 1xx, 2xx, etc.
> including OCLC number and key that they relate too. You can actually do
> this with MarcEdit (if you have mysql installed) -- but on a laptop -- I'm
> not going to guarantee speed with the process. Plus, the process to
> generate the SQL data will be significant. It might take 15 hours to
> generate the database, but then you'd have it and could create indexes on
> it. But you could use it to create the database and then prep the files
> for later work.