I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to its Ruby bindings. It includes OCR if it fails at extracting the text, but it also requires you to install a bunch of other (open source) software. Results seem fine to me so far. Best, Andreas Am 21.06.2011 16:23, schrieb Owen Stephens: > The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories > > We've tried iText but had issues with quality > We moved to PDFBox but are having performance issues > > Any other suggestions/experience? > > Thanks, > > Owen > > Owen Stephens > Owen Stephens Consulting > Web: http://www.ostephens.com > Email: [log in to unmask] > Telephone: 0121 288 6936