The most used open source software for this (and many other mime types) is tika: http://tika.apache.org/ ________________________________________ Van: Code for Libraries [[log in to unmask]] namens Bill Janssen [[log in to unmask]] Verzonden: dinsdag 21 juni 2011 19:19 Aan: [log in to unmask] Onderwerp: Re: [CODE4LIB] PDF->text extraction Owen Stephens <[log in to unmask]> wrote: > The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories > > We've tried iText but had issues with quality > We moved to PDFBox but are having performance issues > > Any other suggestions/experience? UpLib uses xpdf's pdftotext, which works well. There's also code in UpLib to find similarities between papers :-). Bill