It's not exactly what you're looking for, but Microsoft Office comes
with a scripting OCR engine that works on TIFFs. I use it to get text
from yearbooks we are scanning so people can look for names and such.
While I wouldn't put it on par with ABBYY, it does a pretty decent job.
I wrote a simple script in vbscript that scans all the tiff files in a
folder and exports a txt file with the same name as the image that has
all of the text it finds. If you want it, let me know and I'll send it
your way.
Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[log in to unmask]
---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you should
not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake
and delete this e-mail from your system.
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
James Tuttle
Sent: Friday, October 17, 2008 7:57 AM
To: [log in to unmask]
Subject: [CODE4LIB] OCR PDFs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I wonder if any of you might have experience with creating text PDFs
from TIFFs. I've been using tiffcp to stitch TIFFs together into a
single image and then using tiff2pdf to generate PDFs from the single
TIFF. I've had to pass this image-based PDF to someone with Acrobat to
use it's batch processing facility to OCR the text and save a text-based
PDF. I wonder if anyone has suggestions for software I can integrate
into the script (Python on Linux) I'm using.
Thanks,
James
- --
- -------------------------------
James Tuttle
Digital Repository Librarian
NCSU Libraries, Box 7111
North Carolina State University
Raleigh, NC 27695-7111
[log in to unmask]
(919)513-0651 Phone
(919)515-3031 Fax
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFI+H1zKxpLzx+LOWMRAgxIAJwNXyeMJbk6r6hmHpNAdEvWIQbCVgCgp8JR
nyS3WZ4UuRbU/6DTH7ohe/M=
=mT2T
-----END PGP SIGNATURE-----
|