Print

Print


Have you tried Aperture (http://aperture.sourceforge.net/)?  It's a Java library for extracting content from various document formats including PDF.  It comes with command-line scripts that allow you to use it as a stand-alone utility.  If performance is your main concern, this may not be the best option since it's a heavier-duty tool than a simple PDF-only text extractor...  but if you want to expand the number of formats you support, it's worth a look.

- Demian

> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Owen Stephens
> Sent: Tuesday, June 21, 2011 10:24 AM
> To: [log in to unmask]
> Subject: [CODE4LIB] PDF->text extraction
> 
> The CORE project at The Open University in the UK is doing some work on
> finding similarity between papers in institutional repositories (see
> http://core-project.kmi.open.ac.uk/ for more info).  The first step in
> the process is extracting text from the (mainly) pdf documents
> harvested from repositories
> 
> We've tried iText but had issues with quality
> We moved to PDFBox but are having performance issues
> 
> Any other suggestions/experience?
> 
> Thanks,
> 
> Owen
> 
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
> Telephone: 0121 288 6936