Not sure how many people work with MARC Communications format, but I just
discovered at trick that looks useful. I was seeing how fast we can scan
through a file of bib records and wanted to see how fast grep could do it.
Unfortunately, grep does everything on the basis of lines. But translating
the end-of-record marker into a newline makes grep happy to find/count
records containing a regular expression. Even better, because it outputs
'lines' which now correspond to bib records, it can output a file that our
MARC software can use (it doesn't depend on the 0x1D e-o-r value). wc -l is
now happy to count the number of records in such a file, and it seems as
though there must be other Unix tools that would be useful.
This will work even with Unicode MARC, since UTF-8 values preserve the 0-127
bytes.
Maybe others are doing this (or is everyone using XML?), but it's new to us
here. Maybe this would even work with MARC-XML if you restricted linefeeds
to the end of record.
On my workstation, grep can plow through 50 million Unicode MARC-21 records
in less than 15 minutes. The best time our C software can do is more than
half an hour and our Python code could take several hours.
--Thomas Hickey, Chief Scientist, OCLC
--614.764.6000
--mailto:[log in to unmask] <mailto:[log in to unmask]>
--http://errol.oclc.org/laf/n82-54463.html
<http://errol.oclc.org/laf/n82-54463.html>
|