LISTSERV 16.5 - CODE4LIB Archives

The hope is to take these bibliographies put it into more of a web
searchable/sortable format for researchers to make use out of them.  My
colleague was taking some inspiration from the Marlowe Bibliography (
https://marlowebibliography.org/), though we are hoping to possibly get a
bit more robust with the bibliography we are working on.  The important
first step it to be able to parse the existing OCRed bibliography scans we
have into a database, possibly a custom XML format but a database will
probably be easier to append and expand down the road.

On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]>
wrote:

> How you want to preprocess and structure the data depends on what you hope
> to achieve. Can you say more about what you want the end product to look
> like?
>
> kyle
>
> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[log in to unmask]>
> wrote:
>
> > That is a pretty good summation of it yes.  I appreciate the suggestions,
> > this is a bit of a new realm for me and while I know what I want it to do
> > and the structure I want to put it in, the conversion process has been
> > eluding me so thanks for giving me some tools to look into.
> >
> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]>
> wrote:
> >
> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]>
> > > wrote:
> > >
> > > > I am working with colleague on a side project which involves some
> > scanned
> > > > bibliographies and making them more web
> > searchable/sortable/browse-able.
> > > > While I am quite familiar with the metadata and organization aspects
> we
> > > > need, but I am at a bit of a loss on how to automate the process of
> > > putting
> > > > the bibliography in a more structured format so that we can avoid
> going
> > > > through hundreds of pages by hand.  I am pretty sure regular
> > expressions
> > > > are needed, but I have not had an instance where I need to automate
> > > > extracting data from one file type (PDF OCR or text extracted to Word
> > > doc)
> > > > and place it into another (either a database or an XML file) with
> some
> > > > enrichment.  I would appreciate any suggestions for approaches or
> tools
> > > to
> > > > look into.  Thanks for any help/thoughts people can give.
> > >
> > >
> > > If I understand your question correctly, then you have two problems to
> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> > > marking up the result (which is a bibliography) into structure data.
> > > Correct?
> > >
> > > If so, then if your PDF documents have already been OCRed, or if you
> have
> > > other files, then you can probably feed them to TIKA to quickly and
> > easily
> > > extract the underlying plain text. [1] I wrote a brain-dead shell
> script
> > to
> > > run TIKA in server mode and then convert Word (.docx) files. [2]
> > >
> > > When it comes to marking up the result into structured data, well, good
> > > luck. I think such an application is something Library Land sought for
> a
> > > long time. “Can you say Holy Grail?"
> > >
> > > [1] Tika - https://tika.apache.org
> > > [2] brain-dead script -
> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> > >
> > > —
> > > Eric
> > >
> >
>