I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to
its Ruby bindings. It includes OCR if it fails at extracting the text,
but it also requires you to install a bunch of other (open source)
software. Results seem fine to me so far.
Am 21.06.2011 16:23, schrieb Owen Stephens:
> The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repositories
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
> Any other suggestions/experience?
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
> Telephone: 0121 288 6936