The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories We've tried iText but had issues with quality We moved to PDFBox but are having performance issues Any other suggestions/experience? Thanks, Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: [log in to unmask] Telephone: 0121 288 6936