> As for formatting, this one is harder. But instead of trying to solve
> that, I wonder if you're sure it's worth doing. If you're only using the
> OCR to drive search of the scanned page images, why does it matter if there
> are some unnecessary line breaks in your OCR text?
For simple keyword searches, it wouldn't. However if phrase or entity
extraction is an issue, it would be beneficial to remove them. Regex
strikes me as a quick and easy way to accomplish this on a large number of