Hi Code4Lib folks,
I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR. I am trying to parse out a
bibliography with regex. I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page. This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database. So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand? Any thoughts would be
welcome.
Matt Sherman
|