LISTSERV 16.5 - CODE4LIB Archives

Hi Erica,

We are working on a similar project converting  concert performances  
from the past 20 years for our School of Music. though we use simple  
OCR for PDFs (supporting full text searching), we are selectively  
cleaning up OCR for metadata purposes. That is taking the first page  
of PDFs, extracting text and converting said text to titles and dates.  
We use simple regular expressions to remove line breaks and extra  
white spacing.

Here are our working guidelines http://bit.ly/1v0c7w2. Perhaps there  
might be something here that could be of help to you?

Best of luck with your project!

kind regards,
Monica

Quoting Kevin Hawkins <[log in to unmask]>:

> It sounds like there are two sorts of things you need to clean up:
>
> a) OCR errors
>
> b) Formatting (like unnecessary line breaks)
>
> For the former, I understand that Adobe Acrobat and ABBYY FineReader  
> have tools built in to spellchecking.  PrimeOCR, an expensive OCR  
> package, has a related package called PrimeVerify that does this.
>
> If you don't have any of these, you could simply open the OCR output  
> in a text editor with spellchecking to look for things to fix.  You  
> could even copy and paste into Microsoft Word and use its  
> spellchecker; you'd probably need to correct the source file in  
> parallel to scanning it in Word.
>
> As for formatting, this one is harder.  But instead of trying to  
> solve that, I wonder if you're sure it's worth doing.  If you're  
> only using the OCR to drive search of the scanned page images, why  
> does it matter if there are some unnecessary line breaks in your OCR  
> text?
>
> Kevin
>
> On 11/22/14 12:44 PM, scott bacon wrote:
>> Erica,
>>
>> You may find what you need from OpenRefine: http://openrefine.org/
>>
>>
>>
>> On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY <[log in to unmask]> wrote:
>>
>>> Greetings,
>>>
>>> I am working on a project to digitize concert programs. These are the type
>>> of programs you get when attending a musical concert that list performers
>>> and details about the concert.
>>>
>>> Since these items are text heavy we have decided to use OCR software to
>>> output a text file that will enable full text searching in our platform.
>>>
>>> These text files are for the most part accurate, but often have unnecessary
>>> line breaks and pockets of extra characters and/or incorrect
>>> capitalization. I would like to pretty them up a little bit if possible.
>>>
>>> I am wondering if there is a script I can use on multiple files to clean
>>> these type of things up. I don't want to have the digitization staff
>>> manually edit each text file or have to open each one to run a macro in a
>>> text editor.
>>>
>>> I have been searching online and so far haven't found anything that will
>>> work for my situation.
>>>
>>> thanks in advance,
>>>
>>> *Erica Findley*
>>> Cataloging/Metadata Librarian
>>> Multnomah County Library
>>> Phone: 503.988.5466
>>> [log in to unmask]
>>> www.multcolib.org
>>>


Digital Curation Coordinator
Digital Scholarship Services
Fondren Library, Rice University