Thanks, that is interesting since we can export from the PDFs, and while the OCR text is a little messy it is in decent shape. I'll certainly look into that. On Thu, Jun 18, 2015 at 3:13 PM, Gordon, Bonnie <[log in to unmask]> wrote: > We¹re actually also working on getting a bibliography from a Word Doc to a > more structured format. We¹re using regular expressions in LibreOffice > Writer to mark up the citations, then insert tabs between the elements, > and then copy into a spreadsheet (similar to what¹s described in > http://programminghistorian.org/lessons/understanding-regular-expressions > ). > However, our bibliography was originally a Word Doc, not OCRed text. This > method is pretty reliant on consistent formatting, though, so messy OCR > could complicate things. Another thing to note is that it¹s easiest when > you know what format the citation is for (e.g., a book or article), since > that impacts how the citation is structured. I¹d be happy to provide a > sample citation in each step of the process. > > All the best, > Bonnie > > > > On 6/18/15, 1:52 PM, "Matt Sherman" <[log in to unmask]> wrote: > > >The hope is to take these bibliographies put it into more of a web > >searchable/sortable format for researchers to make use out of them. My > >colleague was taking some inspiration from the Marlowe Bibliography ( > >https://marlowebibliography.org/), though we are hoping to possibly get a > >bit more robust with the bibliography we are working on. The important > >first step it to be able to parse the existing OCRed bibliography scans we > >have into a database, possibly a custom XML format but a database will > >probably be easier to append and expand down the road. > > > >On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]> > >wrote: > > > >> How you want to preprocess and structure the data depends on what you > >>hope > >> to achieve. Can you say more about what you want the end product to look > >> like? > >> > >> kyle > >> > >> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman > >><[log in to unmask]> > >> wrote: > >> > >> > That is a pretty good summation of it yes. I appreciate the > >>suggestions, > >> > this is a bit of a new realm for me and while I know what I want it > >>to do > >> > and the structure I want to put it in, the conversion process has been > >> > eluding me so thanks for giving me some tools to look into. > >> > > >> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> > >> wrote: > >> > > >> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman > >><[log in to unmask]> > >> > > wrote: > >> > > > >> > > > I am working with colleague on a side project which involves some > >> > scanned > >> > > > bibliographies and making them more web > >> > searchable/sortable/browse-able. > >> > > > While I am quite familiar with the metadata and organization > >>aspects > >> we > >> > > > need, but I am at a bit of a loss on how to automate the process > >>of > >> > > putting > >> > > > the bibliography in a more structured format so that we can avoid > >> going > >> > > > through hundreds of pages by hand. I am pretty sure regular > >> > expressions > >> > > > are needed, but I have not had an instance where I need to > >>automate > >> > > > extracting data from one file type (PDF OCR or text extracted to > >>Word > >> > > doc) > >> > > > and place it into another (either a database or an XML file) with > >> some > >> > > > enrichment. I would appreciate any suggestions for approaches or > >> tools > >> > > to > >> > > > look into. Thanks for any help/thoughts people can give. > >> > > > >> > > > >> > > If I understand your question correctly, then you have two problems > >>to > >> > > address: 1) converting PDF, Word, etc. files into plain text, and 2) > >> > > marking up the result (which is a bibliography) into structure data. > >> > > Correct? > >> > > > >> > > If so, then if your PDF documents have already been OCRed, or if you > >> have > >> > > other files, then you can probably feed them to TIKA to quickly and > >> > easily > >> > > extract the underlying plain text. [1] I wrote a brain-dead shell > >> script > >> > to > >> > > run TIKA in server mode and then convert Word (.docx) files. [2] > >> > > > >> > > When it comes to marking up the result into structured data, well, > >>good > >> > > luck. I think such an application is something Library Land sought > >>for > >> a > >> > > long time. ³Can you say Holy Grail?" > >> > > > >> > > [1] Tika - https://tika.apache.org > >> > > [2] brain-dead script - > >> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff > >> > > > >> > > ‹ > >> > > Eric > >> > > > >> > > >> >