Print

Print


Thanks, that is interesting since we can export from the PDFs, and while
the OCR text is a little messy it is in decent shape.  I'll certainly look
into that.

On Thu, Jun 18, 2015 at 3:13 PM, Gordon, Bonnie <[log in to unmask]>
wrote:

> We¹re actually also working on getting a bibliography from a Word Doc to a
> more structured format. We¹re using regular expressions in LibreOffice
> Writer to mark up the citations, then insert tabs between the elements,
> and then copy into a spreadsheet (similar to what¹s described in
> http://programminghistorian.org/lessons/understanding-regular-expressions
> ).
>  However, our bibliography was originally a Word Doc, not OCRed text. This
> method is pretty reliant on consistent formatting, though, so messy OCR
> could complicate things. Another thing to note is that it¹s easiest when
> you know what format the citation is for (e.g., a book or article), since
> that impacts how the citation is structured.  I¹d be happy to provide a
> sample citation in each step of the process.
>
> All the best,
> Bonnie
>
>
>
> On 6/18/15, 1:52 PM, "Matt Sherman" <[log in to unmask]> wrote:
>
> >The hope is to take these bibliographies put it into more of a web
> >searchable/sortable format for researchers to make use out of them.  My
> >colleague was taking some inspiration from the Marlowe Bibliography (
> >https://marlowebibliography.org/), though we are hoping to possibly get a
> >bit more robust with the bibliography we are working on.  The important
> >first step it to be able to parse the existing OCRed bibliography scans we
> >have into a database, possibly a custom XML format but a database will
> >probably be easier to append and expand down the road.
> >
> >On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]>
> >wrote:
> >
> >> How you want to preprocess and structure the data depends on what you
> >>hope
> >> to achieve. Can you say more about what you want the end product to look
> >> like?
> >>
> >> kyle
> >>
> >> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman
> >><[log in to unmask]>
> >> wrote:
> >>
> >> > That is a pretty good summation of it yes.  I appreciate the
> >>suggestions,
> >> > this is a bit of a new realm for me and while I know what I want it
> >>to do
> >> > and the structure I want to put it in, the conversion process has been
> >> > eluding me so thanks for giving me some tools to look into.
> >> >
> >> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]>
> >> wrote:
> >> >
> >> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman
> >><[log in to unmask]>
> >> > > wrote:
> >> > >
> >> > > > I am working with colleague on a side project which involves some
> >> > scanned
> >> > > > bibliographies and making them more web
> >> > searchable/sortable/browse-able.
> >> > > > While I am quite familiar with the metadata and organization
> >>aspects
> >> we
> >> > > > need, but I am at a bit of a loss on how to automate the process
> >>of
> >> > > putting
> >> > > > the bibliography in a more structured format so that we can avoid
> >> going
> >> > > > through hundreds of pages by hand.  I am pretty sure regular
> >> > expressions
> >> > > > are needed, but I have not had an instance where I need to
> >>automate
> >> > > > extracting data from one file type (PDF OCR or text extracted to
> >>Word
> >> > > doc)
> >> > > > and place it into another (either a database or an XML file) with
> >> some
> >> > > > enrichment.  I would appreciate any suggestions for approaches or
> >> tools
> >> > > to
> >> > > > look into.  Thanks for any help/thoughts people can give.
> >> > >
> >> > >
> >> > > If I understand your question correctly, then you have two problems
> >>to
> >> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> >> > > marking up the result (which is a bibliography) into structure data.
> >> > > Correct?
> >> > >
> >> > > If so, then if your PDF documents have already been OCRed, or if you
> >> have
> >> > > other files, then you can probably feed them to TIKA to quickly and
> >> > easily
> >> > > extract the underlying plain text. [1] I wrote a brain-dead shell
> >> script
> >> > to
> >> > > run TIKA in server mode and then convert Word (.docx) files. [2]
> >> > >
> >> > > When it comes to marking up the result into structured data, well,
> >>good
> >> > > luck. I think such an application is something Library Land sought
> >>for
> >> a
> >> > > long time. ³Can you say Holy Grail?"
> >> > >
> >> > > [1] Tika - https://tika.apache.org
> >> > > [2] brain-dead script -
> >> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> >> > >
> >> > > ‹
> >> > > Eric
> >> > >
> >> >
> >>
>