LISTSERV 16.5 - CODE4LIB Archives

It sounds like there are two sorts of things you need to clean up:

a) OCR errors

b) Formatting (like unnecessary line breaks)

For the former, I understand that Adobe Acrobat and ABBYY FineReader 
have tools built in to spellchecking.  PrimeOCR, an expensive OCR 
package, has a related package called PrimeVerify that does this.

If you don't have any of these, you could simply open the OCR output in 
a text editor with spellchecking to look for things to fix.  You could 
even copy and paste into Microsoft Word and use its spellchecker; you'd 
probably need to correct the source file in parallel to scanning it in Word.

As for formatting, this one is harder.  But instead of trying to solve 
that, I wonder if you're sure it's worth doing.  If you're only using 
the OCR to drive search of the scanned page images, why does it matter 
if there are some unnecessary line breaks in your OCR text?

Kevin

On 11/22/14 12:44 PM, scott bacon wrote:
> Erica,
>
> You may find what you need from OpenRefine: http://openrefine.org/
>
>
>
> On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY <[log in to unmask]> wrote:
>
>> Greetings,
>>
>> I am working on a project to digitize concert programs. These are the type
>> of programs you get when attending a musical concert that list performers
>> and details about the concert.
>>
>> Since these items are text heavy we have decided to use OCR software to
>> output a text file that will enable full text searching in our platform.
>>
>> These text files are for the most part accurate, but often have unnecessary
>> line breaks and pockets of extra characters and/or incorrect
>> capitalization. I would like to pretty them up a little bit if possible.
>>
>> I am wondering if there is a script I can use on multiple files to clean
>> these type of things up. I don't want to have the digitization staff
>> manually edit each text file or have to open each one to run a macro in a
>> text editor.
>>
>> I have been searching online and so far haven't found anything that will
>> work for my situation.
>>
>> thanks in advance,
>>
>> *Erica Findley*
>> Cataloging/Metadata Librarian
>> Multnomah County Library
>> Phone: 503.988.5466
>> [log in to unmask]
>> www.multcolib.org
>>