LISTSERV 16.5 - CODE4LIB Archives

On Dec 9, 2014, at 8:25 AM, Kyle Banerjee <[log in to unmask]> wrote:

> I've just started a project that involves harvesting large numbers of
> scanned PDF's and extracting information from the text from the OCR output.
> The process I've started with -- use imagemagick to convert to tiff and
> tesseract to pull out the OCR -- is more system intensive than I hoped it
> would be.

I’m not quite sure if I understand the question, but if all you want to do is pull the text out of an OCR’ed PDF file, then I have found both Tika and PDFtotext to be useful tools. [1, 2] Here’s a Perl script that takes a PDF as input and used to Tika to output the OCR’ed text:

  #!/usr/bin/perl

  # configure
  use constant TIKA => 'java -jar tika.jar -T ';

  # require
  use strict;

  # initialize; needs sanity checking
  my $cmd = TIKA . $ARGV[ 0 ];

  # do the work
  print system $cmd;

  # done
  exit;

Tika can run in a server mode making it more efficient for extracting the text from multiple files. 

On the other hand, if you need to do the OCR itself, then employing Tesseract is probably the way to go. 

[1] Tika - http://tika.apache.org
[2] PDFtoText - http://www.foolabs.com/xpdf/download.html

—
ELM