Print

Print


Matt,

There are probably a dozen ways to do this, but it would be really helpful to know what operating system you are on? For example, if you are using Linux, you can run it through sed using 
  cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE>
see http://stackoverflow.com/a/800644/2896744 for more info
________________________________________
From: Code for Libraries [[log in to unmask]] On Behalf Of Matt Sherman [[log in to unmask]]
Sent: Monday, August 03, 2015 10:29 PM
To: [log in to unmask]
Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

Hi Code4Lib folks,

I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR.  I am trying to parse out a
bibliography with regex.  I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page.  This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database.  So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand?  Any thoughts would be
welcome.

Matt Sherman