The hope is to take these bibliographies put it into more of a web searchable/sortable format for researchers to make use out of them. My colleague was taking some inspiration from the Marlowe Bibliography ( https://marlowebibliography.org/), though we are hoping to possibly get a bit more robust with the bibliography we are working on. The important first step it to be able to parse the existing OCRed bibliography scans we have into a database, possibly a custom XML format but a database will probably be easier to append and expand down the road. On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]> wrote: > How you want to preprocess and structure the data depends on what you hope > to achieve. Can you say more about what you want the end product to look > like? > > kyle > > On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[log in to unmask]> > wrote: > > > That is a pretty good summation of it yes. I appreciate the suggestions, > > this is a bit of a new realm for me and while I know what I want it to do > > and the structure I want to put it in, the conversion process has been > > eluding me so thanks for giving me some tools to look into. > > > > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> > wrote: > > > > > On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]> > > > wrote: > > > > > > > I am working with colleague on a side project which involves some > > scanned > > > > bibliographies and making them more web > > searchable/sortable/browse-able. > > > > While I am quite familiar with the metadata and organization aspects > we > > > > need, but I am at a bit of a loss on how to automate the process of > > > putting > > > > the bibliography in a more structured format so that we can avoid > going > > > > through hundreds of pages by hand. I am pretty sure regular > > expressions > > > > are needed, but I have not had an instance where I need to automate > > > > extracting data from one file type (PDF OCR or text extracted to Word > > > doc) > > > > and place it into another (either a database or an XML file) with > some > > > > enrichment. I would appreciate any suggestions for approaches or > tools > > > to > > > > look into. Thanks for any help/thoughts people can give. > > > > > > > > > If I understand your question correctly, then you have two problems to > > > address: 1) converting PDF, Word, etc. files into plain text, and 2) > > > marking up the result (which is a bibliography) into structure data. > > > Correct? > > > > > > If so, then if your PDF documents have already been OCRed, or if you > have > > > other files, then you can probably feed them to TIKA to quickly and > > easily > > > extract the underlying plain text. [1] I wrote a brain-dead shell > script > > to > > > run TIKA in server mode and then convert Word (.docx) files. [2] > > > > > > When it comes to marking up the result into structured data, well, good > > > luck. I think such an application is something Library Land sought for > a > > > long time. “Can you say Holy Grail?" > > > > > > [1] Tika - https://tika.apache.org > > > [2] brain-dead script - > > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff > > > > > > — > > > Eric > > > > > >