LISTSERV 16.5 - CODE4LIB Archives

Not sure how many people work with MARC Communications format, but I just
discovered at trick that looks useful.  I was seeing how fast we can scan
through a file of bib records and wanted to see how fast grep could do it.
Unfortunately, grep does everything on the basis of lines.  But translating
the end-of-record marker into a newline makes grep happy to find/count
records containing a regular expression.  Even better, because it outputs
'lines' which now correspond to bib records, it can output a file that our
MARC software can use (it doesn't depend on the 0x1D e-o-r value).  wc -l is
now happy to count the number of records in such a file, and it seems as
though there must be other Unix tools that would be useful.



This will work even with Unicode MARC, since UTF-8 values preserve the 0-127
bytes.



Maybe others are doing this (or is everyone using XML?), but it's new to us
here.  Maybe this would even work with MARC-XML if you restricted linefeeds
to the end of record.



On my workstation, grep can plow through 50 million Unicode MARC-21 records
in less than 15 minutes.  The best time our C software can do is more than
half an hour and our Python code could take several hours.



--Thomas Hickey, Chief Scientist, OCLC

--614.764.6000

--mailto:[log in to unmask] <mailto:[log in to unmask]>

--http://errol.oclc.org/laf/n82-54463.html
<http://errol.oclc.org/laf/n82-54463.html>