I am working with colleague on a side project which involves some scanned
bibliographies and making them more web searchable/sortable/browse-able.
While I am quite familiar with the metadata and organization aspects we
need, but I am at a bit of a loss on how to automate the process of putting
the bibliography in a more structured format so that we can avoid going
through hundreds of pages by hand. I am pretty sure regular expressions
are needed, but I have not had an instance where I need to automate
extracting data from one file type (PDF OCR or text extracted to Word doc)
and place it into another (either a database or an XML file) with some
enrichment. I would appreciate any suggestions for approaches or tools to
look into. Thanks for any help/thoughts people can give.