See also http://wiki.tei-c.org/index.php/Heuristics , which discusses this problem more broadly conceived. I've just added a link to the archives of this very discussion. --Kevin On 6/18/15 12:52 PM, Matt Sherman wrote: > The hope is to take these bibliographies put it into more of a web > searchable/sortable format for researchers to make use out of them. My > colleague was taking some inspiration from the Marlowe Bibliography ( > https://marlowebibliography.org/), though we are hoping to possibly get a > bit more robust with the bibliography we are working on. The important > first step it to be able to parse the existing OCRed bibliography scans we > have into a database, possibly a custom XML format but a database will > probably be easier to append and expand down the road. > > On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]> > wrote: > >> How you want to preprocess and structure the data depends on what you hope >> to achieve. Can you say more about what you want the end product to look >> like? >> >> kyle >> >> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[log in to unmask]> >> wrote: >> >>> That is a pretty good summation of it yes. I appreciate the suggestions, >>> this is a bit of a new realm for me and while I know what I want it to do >>> and the structure I want to put it in, the conversion process has been >>> eluding me so thanks for giving me some tools to look into. >>> >>> On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> >> wrote: >>> >>>> On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]> >>>> wrote: >>>> >>>>> I am working with colleague on a side project which involves some >>> scanned >>>>> bibliographies and making them more web >>> searchable/sortable/browse-able. >>>>> While I am quite familiar with the metadata and organization aspects >> we >>>>> need, but I am at a bit of a loss on how to automate the process of >>>> putting >>>>> the bibliography in a more structured format so that we can avoid >> going >>>>> through hundreds of pages by hand. I am pretty sure regular >>> expressions >>>>> are needed, but I have not had an instance where I need to automate >>>>> extracting data from one file type (PDF OCR or text extracted to Word >>>> doc) >>>>> and place it into another (either a database or an XML file) with >> some >>>>> enrichment. I would appreciate any suggestions for approaches or >> tools >>>> to >>>>> look into. Thanks for any help/thoughts people can give. >>>> >>>> >>>> If I understand your question correctly, then you have two problems to >>>> address: 1) converting PDF, Word, etc. files into plain text, and 2) >>>> marking up the result (which is a bibliography) into structure data. >>>> Correct? >>>> >>>> If so, then if your PDF documents have already been OCRed, or if you >> have >>>> other files, then you can probably feed them to TIKA to quickly and >>> easily >>>> extract the underlying plain text. [1] I wrote a brain-dead shell >> script >>> to >>>> run TIKA in server mode and then convert Word (.docx) files. [2] >>>> >>>> When it comes to marking up the result into structured data, well, good >>>> luck. I think such an application is something Library Land sought for >> a >>>> long time. “Can you say Holy Grail?" >>>> >>>> [1] Tika - https://tika.apache.org >>>> [2] brain-dead script - >>>> https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff >>>> >>>> — >>>> Eric >>>> >>> >>