Print

Print


Out of the box, it's hard to top Abbyy, but Tesseract is well worth investigating, especially if you are dealing with a large quantity of consistent images. The Tesseract community has created a very useful wiki [1], especially on how to improve the quality of images that need to be OCRed [2], and there is some new neural network based plumbing that has great potential [3]. Tesseract also lets you do your own font training, I work with a non-profit called OurDigitalWorld that needed Inuktitut support for a publication called "Inuit Today" and we were able to create the supporting files to do the processing, an approach you can also use for special symbols in text (musical notation, etc.) If you combine Tesseract with other open source tools like Imagemagick (to prep images), Olena (to segment column-heavy media like newspapers), and Hadoop (if you are working with thousands or millions of pages), it can do a lot of heavy lifting. 

art
---
1. https://github.com/tesseract-ocr/tesseract/wiki
2. https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
3. https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Will Martin
Sent: Wednesday, July 19, 2017 1:14 PM
To: [log in to unmask]
Subject: [CODE4LIB] OCR software

All,

What are you all using for OCR software?  How well does it work for you? 
  Do you find that need to scan at a particular resolution to get optimal OCR results, or do you find yourself doing post-processing on the images before OCR'ing them?  What have your experiences been like?

In the past, we've just used the built-in OCR in Adobe Acrobat Pro.  But we're looking at doing a bunch more digitization than we have before, and I just want to take stock of what's out there and see if that's an acceptable solution or if there's something else we should consider.

Thanks!

Will Martin

Head of Digital Initiatives, Systems & Services Chester Fritz Library University of North Dakota