The most used open source software for this (and many other mime types) is tika:
Van: Code for Libraries [[log in to unmask]] namens Bill Janssen [[log in to unmask]]
Verzonden: dinsdag 21 juni 2011 19:19
Aan: [log in to unmask]
Onderwerp: Re: [CODE4LIB] PDF->text extraction

Owen Stephens <[log in to unmask]> wrote:

> The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see for more info).  The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
> Any other suggestions/experience?

UpLib uses xpdf's pdftotext, which works well.  There's also code in
UpLib to find similarities between papers :-).