Have you tried Aperture (http://aperture.sourceforge.net/)? It's a Java library for extracting content from various document formats including PDF. It comes with command-line scripts that allow you to use it as a stand-alone utility. If performance is your main concern, this may not be the best option since it's a heavier-duty tool than a simple PDF-only text extractor... but if you want to expand the number of formats you support, it's worth a look. - Demian > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Owen Stephens > Sent: Tuesday, June 21, 2011 10:24 AM > To: [log in to unmask] > Subject: [CODE4LIB] PDF->text extraction > > The CORE project at The Open University in the UK is doing some work on > finding similarity between papers in institutional repositories (see > http://core-project.kmi.open.ac.uk/ for more info). The first step in > the process is extracting text from the (mainly) pdf documents > harvested from repositories > > We've tried iText but had issues with quality > We moved to PDFBox but are having performance issues > > Any other suggestions/experience? > > Thanks, > > Owen > > Owen Stephens > Owen Stephens Consulting > Web: http://www.ostephens.com > Email: [log in to unmask] > Telephone: 0121 288 6936