I'm working on scanning some documents in a collection and then preforming
OCR on the documents. Thus far, I've used Adobe Acrobat Pro's OCR function
with some success but the machines I'm working on are fairly old Pentium 4
Dell boxes, this makes opening 600 DPI scans painful and preforming OCR an
entirely valid excuse for a long coffee break.
As you might expect, I'm looking for a way to speed up this process at the
OCR end of things, since the scanning can only move so quickly. I'm
wondering if any of you have experience with any open OCR solutions such as:
Tesseract-OCR <http://code.google.com/p/tesseract-ocr/> or
ocropus<http://code.google.com/p/ocropus/>.
At a glance, Tesseract seems to be further along in development. Any other
suggestions on how best to approach this sort of task would be appreciated
if you've done similar work.
I've got my own Ubuntu Server I'm planning on evaluating one or both of
these on, as much for my own interest as the project's or the
organization's. Since I'm an unpaid part-time intern and the only one who's
working on this project, I'm willing to learn to do things the hard way so
they're easier in the long run.
Thanks for any suggestions or advice you may be able to offer.
--
~Andrew M. Kelly
MLIS Degree Candidate, Simmons GSLIS 2011
Archives & Librarianship Intern, Boston University: African Presidential
Archive & Research Center
Evening Library Assistant, Bay State College
twitter: @a_m_kelly
|