A word document does funny things to the text since it is actually html (try opening a .doc in a plain text editor and you will see it is html). I would try and get the plain ASCII text instead, and then install Cygwin which contains Sed and a bunch of other usful Unix/Linux commands.
see http://stackoverflow.com/a/127567/2896744 for more info.
From: Code for Libraries [[log in to unmask]] On Behalf Of Matt Sherman [[log in to unmask]]
Sent: Tuesday, August 04, 2015 9:09 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
I am on Windows machines, so I don't have quite the easy access to
that useful command. Someone had earlier put the OCR in a doc file so
I've been playing with that more than with the raw PDF OCR.
On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <[log in to unmask]> wrote:
> There are probably a dozen ways to do this, but it would be really helpful to know what operating system you are on? For example, if you are using Linux, you can run it through sed using
> cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE>
> see http://stackoverflow.com/a/800644/2896744 for more info
> From: Code for Libraries [[log in to unmask]] On Behalf Of Matt Sherman [[log in to unmask]]
> Sent: Monday, August 03, 2015 10:29 PM
> To: [log in to unmask]
> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
> Hi Code4Lib folks,
> I was wondering if anyone had some experience cleaning up OCR text.
> Particularly I am trying to figure out how I can deal with the random
> line breaks that come from OCR. I am trying to parse out a
> bibliography with regex. I think I've figured out which queries I
> need to run to break it up so I can make it into a tab delimited text
> file but I noticed that the text does the classic thing of OCR
> inserting line breaks where they physically are on the page. This
> will obviously be a bit of an issue since it would break the
> annotation into a bunch of lines rather than leaving it one block so I
> can manipulate it into a database. So I am wondering if anyone who
> has worked with OCR text before has a suggested way to clean up those
> line breaks without doing 300 + pages by hand? Any thoughts would be
> Matt Sherman