PDFBox (and hence Tika) get worse the more recent a version of the PDF
format you use. One fun trick they can do is get a tad confused and think
there are control characters in extracted metadata fields. Great fun when
those characters are then inserted into an XML CMIS response. (Why do I
seem to do most of my debugging with wireshark?)
Usually they're not that ill behaved, but PDFBox really needs to be run out
of head.
There's always http://open.xerox.com/Services/PDF-to-XML (which is from Euro
Xerox, and which is also xpdf based).
Source forge project page at http://sourceforge.net/projects/pdf2xml/
One option for licensed software is Adobe's PDF Library SDK
<http://www.adobe.com/devnet/pdf/library.html>which may be available at an
educational discount.
Another option is to use the ABBYY FineReader
SDK<http://www.abbyy.com/ocr_sdk_linux/overview/>.
Annoyingly, the linux version is one release behind the windows SDK (which
has improved support for multi core processing of single document). Since
Owen's problem is embarrassingly parallel, multi-core tuning isn't as
useful as being able to run on a local cluster or regional grid. ABBYY
software tends to be a little pricey, but the results are usually very good.
ABBYY's OCR code seems to be a bit better than Adobe's if you're dealing
with non-searchable PDFs.
Kofax probably has a really good product in this space, but I expect the
academic discount brings the price down to only half your first-born child.
BTW, has anybody ever tried using something like MRBAYES to see how well it
can create phylogeny trees for multiple versions of documents?
Or would that distract from compute time needed to process lamprey
#960,345? [someone ought to produce graphs showing projected future storage
needs with and without short-read lamprey fragments.
Simon
On Tue, Jun 21, 2011 at 3:16 PM, Bill Janssen <[log in to unmask]> wrote:
> Boheemen, Peter van <[log in to unmask]> wrote:
>
> > The most used open source software for this (and many other mime
> > types) is tika: http://tika.apache.org/
>
> While I'm sure it's widely used, it's also relatively immature. For
> PDF, it just punts to PDFBox (which is also relatively immature).
>
> The most widely used commercial package for extracting text from PDF,
> which does an excellent job, is probably TET, from pdflib.com. TET has
> lots of plug-ins for various contexts.
>
> Bill
>
> > ________________________________________
> > Van: Code for Libraries [[log in to unmask]] namens Bill Janssen [
> [log in to unmask]]
> > Verzonden: dinsdag 21 juni 2011 19:19
> > Aan: [log in to unmask]
> > Onderwerp: Re: [CODE4LIB] PDF->text extraction
> >
> > Owen Stephens <[log in to unmask]> wrote:
> >
> > > The CORE project at The Open University in the UK is doing some work on
> finding similarity between papers in institutional repositories (see
> http://core-project.kmi.open.ac.uk/ for more info). The first step in the
> process is extracting text from the (mainly) pdf documents harvested from
> repositories
> > >
> > > We've tried iText but had issues with quality
> > > We moved to PDFBox but are having performance issues
> > >
> > > Any other suggestions/experience?
> >
> > UpLib uses xpdf's pdftotext, which works well. There's also code in
> > UpLib to find similarities between papers :-).
> >
> > Bill
>
|