Print

Print


In our testing, the effectiveness relies heavily on the era of the type, and the cleanliness of the original to avoid artifacts. A 1940s typewritten document that is a carbon paper copy is not going to do nearly as well as a clean printed document in times new roman.

We ran tests on some art history theses that are mostly late 20th century and compared Adobe Acrobat, Abbyy Finereader, Scandall Pro, and Google’s Tesseract. Acrobat, Abbyy, and Scandall were all nearly identical (Tesseract was a joke). Scandall was useful because we were batch scanning on a Fujitsu scanner and it could scan, create pdf, and ocr all at once. Abbyy Recognition Server can also automate the process once you have scans made. 

We also tested scanning in RGB and B&W and with and without post-processing and got much better results scanning in color and batch converting to greyscale, it resulted in a lot fewer artifacts tripping up the OCR. We scanned at 400 dpi but also ran OCR on 72 dpi and came up with similar results. 


TL;DR: Acrobat works just as well as Abbyy and Scandall Pro. Tesseract doesn’t. 


 

-- 
Kerry Bannen, Digitization Workflow Technician
Digital Production Center
University of North Carolina at Chapel Hill
[log in to unmask]
(919) 962-1334 <tel:(919)%20962-1334>

On 7/19/17, 1:13 PM, "Code for Libraries on behalf of Will Martin" <[log in to unmask] on behalf of [log in to unmask]> wrote:

    All,
    
    What are you all using for OCR software?  How well does it work for you? 
      Do you find that need to scan at a particular resolution to get optimal 
    OCR results, or do you find yourself doing post-processing on the images 
    before OCR'ing them?  What have your experiences been like?
    
    In the past, we've just used the built-in OCR in Adobe Acrobat Pro.  But 
    we're looking at doing a bunch more digitization than we have before, 
    and I just want to take stock of what's out there and see if that's an 
    acceptable solution or if there's something else we should consider.
    
    Thanks!
    
    Will Martin
    
    Head of Digital Initiatives, Systems & Services
    Chester Fritz Library
    University of North Dakota