It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/ <http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/>, and an example of how I’ve used it at https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py <https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py>
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936
> On 18 Jun 2015, at 17:02, Matt Sherman <[log in to unmask]> wrote:
>
> Hi Code4Libbers,
>
> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand. I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment. I would appreciate any suggestions for approaches or tools to
> look into. Thanks for any help/thoughts people can give.
>
> Matt Sherman
|