Print

Print


Hi Code4Lib folks,

I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR.  I am trying to parse out a
bibliography with regex.  I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page.  This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database.  So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand?  Any thoughts would be
welcome.

Matt Sherman