It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/ <http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/>, and an example of how I’ve used it at https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py <https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py> Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: [log in to unmask] Telephone: 0121 288 6936 > On 18 Jun 2015, at 17:02, Matt Sherman <[log in to unmask]> wrote: > > Hi Code4Libbers, > > I am working with colleague on a side project which involves some scanned > bibliographies and making them more web searchable/sortable/browse-able. > While I am quite familiar with the metadata and organization aspects we > need, but I am at a bit of a loss on how to automate the process of putting > the bibliography in a more structured format so that we can avoid going > through hundreds of pages by hand. I am pretty sure regular expressions > are needed, but I have not had an instance where I need to automate > extracting data from one file type (PDF OCR or text extracted to Word doc) > and place it into another (either a database or an XML file) with some > enrichment. I would appreciate any suggestions for approaches or tools to > look into. Thanks for any help/thoughts people can give. > > Matt Sherman