Print

Print


Ocropus actually uses Tesseract as its OCR engine (with the idea that
eventually you'll be able to plug other engines in), and adds the layout
analysis component to it. I've been using it to OCR old manual
typewriter pages and I've found it surprisingly good for that purpose.
It uses the hOCR standard for its output, which takes a little getting
used to (it's HTML with lots of positional markup), but it's easy to
convert to XML for further processing. I use scripts that use
ImageMagick to generate smaller images (300dpi, grayscale) to feed into
Ocropus. 

Peter



-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Andy Kelly
Sent: Wednesday, July 28, 2010 9:47 AM
To: [log in to unmask]
Subject: [CODE4LIB] Free/Open OCR solutions?

I'm working on scanning some documents in a collection and then
preforming
OCR on the documents. Thus far, I've used Adobe Acrobat Pro's OCR
function
with some success but the machines I'm working on are fairly old Pentium
4
Dell boxes, this makes opening 600 DPI scans painful and preforming OCR
an
entirely valid excuse for a long coffee break.

As you might expect, I'm looking for a way to speed up this process at
the
OCR end of things, since the scanning can only move so quickly. I'm
wondering if any of you have experience with any open OCR solutions such
as:
Tesseract-OCR <http://code.google.com/p/tesseract-ocr/> or
ocropus<http://code.google.com/p/ocropus/>.
At a glance, Tesseract seems to be further along in development. Any
other
suggestions on how best to approach this sort of task would be
appreciated
if you've done similar work.

I've got my own Ubuntu Server I'm planning on evaluating one or both of
these on, as much for my own interest as the project's or the
organization's. Since I'm an unpaid part-time intern and the only one
who's
working on this project, I'm willing to learn to do things the hard way
so
they're easier in the long run.

Thanks for any suggestions or advice you may be able to offer.

-- 
~Andrew M. Kelly
MLIS Degree Candidate, Simmons GSLIS 2011
Archives & Librarianship Intern, Boston University: African Presidential
Archive & Research Center
Evening Library Assistant, Bay State College
twitter: @a_m_kelly