Have you tried Aperture (http://aperture.sourceforge.net/)? It's a Java library for extracting content from various document formats including PDF. It comes with command-line scripts that allow you to use it as a stand-alone utility. If performance is your main concern, this may not be the best option since it's a heavier-duty tool than a simple PDF-only text extractor... but if you want to expand the number of formats you support, it's worth a look.
- Demian
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Owen Stephens
> Sent: Tuesday, June 21, 2011 10:24 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] PDF->text extraction
>
> The CORE project at The Open University in the UK is doing some work on
> finding similarity between papers in institutional repositories (see
> http://core-project.kmi.open.ac.uk/ for more info). The first step in
> the process is extracting text from the (mainly) pdf documents
> harvested from repositories
>
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
>
> Any other suggestions/experience?
>
> Thanks,
>
> Owen
>
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
> Telephone: 0121 288 6936
|