Print

Print


It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/ <http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/>, and an example of how I’ve used it at https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py <https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py>

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936

> On 18 Jun 2015, at 17:02, Matt Sherman <[log in to unmask]> wrote:
> 
> Hi Code4Libbers,
> 
> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.
> 
> Matt Sherman