That is a pretty good summation of it yes. I appreciate the suggestions, this is a bit of a new realm for me and while I know what I want it to do and the structure I want to put it in, the conversion process has been eluding me so thanks for giving me some tools to look into. On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> wrote: > On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]> > wrote: > > > I am working with colleague on a side project which involves some scanned > > bibliographies and making them more web searchable/sortable/browse-able. > > While I am quite familiar with the metadata and organization aspects we > > need, but I am at a bit of a loss on how to automate the process of > putting > > the bibliography in a more structured format so that we can avoid going > > through hundreds of pages by hand. I am pretty sure regular expressions > > are needed, but I have not had an instance where I need to automate > > extracting data from one file type (PDF OCR or text extracted to Word > doc) > > and place it into another (either a database or an XML file) with some > > enrichment. I would appreciate any suggestions for approaches or tools > to > > look into. Thanks for any help/thoughts people can give. > > > If I understand your question correctly, then you have two problems to > address: 1) converting PDF, Word, etc. files into plain text, and 2) > marking up the result (which is a bibliography) into structure data. > Correct? > > If so, then if your PDF documents have already been OCRed, or if you have > other files, then you can probably feed them to TIKA to quickly and easily > extract the underlying plain text. [1] I wrote a brain-dead shell script to > run TIKA in server mode and then convert Word (.docx) files. [2] > > When it comes to marking up the result into structured data, well, good > luck. I think such an application is something Library Land sought for a > long time. “Can you say Holy Grail?" > > [1] Tika - https://tika.apache.org > [2] brain-dead script - > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff > > — > Eric >