Thanks everyone for your ideas and suggestions. There are many things I am
going to take a look at here and perhaps this is a good time for me to
learn some regular expressions.
I also want to respond regarding my desire to clean up the formatting of
the OCR data (line breaks, junk characters, spacing, etc.). In our current
web platform for digital objects I input the OCR text in to a field (either
manually or by batch import). Having clean formatting without line breaks
or extra characters will make the data in that field more portable. This
data may be exported, harvested, and/or eventually migrated. I figured that
getting the extra stuff out now would save some headaches later. Having it
look nice to humans is a plus.
Thanks again! I will share the solution I implement when I get there.
Multnomah County Library
[log in to unmask]
On Mon, Nov 24, 2014 at 8:43 AM, Kyle Banerjee <[log in to unmask]>
> > As for formatting, this one is harder. But instead of trying to solve
> > that, I wonder if you're sure it's worth doing. If you're only using the
> > OCR to drive search of the scanned page images, why does it matter if
> > are some unnecessary line breaks in your OCR text?
> For simple keyword searches, it wouldn't. However if phrase or entity
> extraction is an issue, it would be beneficial to remove them. Regex
> strikes me as a quick and easy way to accomplish this on a large number of