Print

Print


How you want to preprocess and structure the data depends on what you hope
to achieve. Can you say more about what you want the end product to look
like?

kyle

On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[log in to unmask]>
wrote:

> That is a pretty good summation of it yes.  I appreciate the suggestions,
> this is a bit of a new realm for me and while I know what I want it to do
> and the structure I want to put it in, the conversion process has been
> eluding me so thanks for giving me some tools to look into.
>
> On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> wrote:
>
> > On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]>
> > wrote:
> >
> > > I am working with colleague on a side project which involves some
> scanned
> > > bibliographies and making them more web
> searchable/sortable/browse-able.
> > > While I am quite familiar with the metadata and organization aspects we
> > > need, but I am at a bit of a loss on how to automate the process of
> > putting
> > > the bibliography in a more structured format so that we can avoid going
> > > through hundreds of pages by hand.  I am pretty sure regular
> expressions
> > > are needed, but I have not had an instance where I need to automate
> > > extracting data from one file type (PDF OCR or text extracted to Word
> > doc)
> > > and place it into another (either a database or an XML file) with some
> > > enrichment.  I would appreciate any suggestions for approaches or tools
> > to
> > > look into.  Thanks for any help/thoughts people can give.
> >
> >
> > If I understand your question correctly, then you have two problems to
> > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> > marking up the result (which is a bibliography) into structure data.
> > Correct?
> >
> > If so, then if your PDF documents have already been OCRed, or if you have
> > other files, then you can probably feed them to TIKA to quickly and
> easily
> > extract the underlying plain text. [1] I wrote a brain-dead shell script
> to
> > run TIKA in server mode and then convert Word (.docx) files. [2]
> >
> > When it comes to marking up the result into structured data, well, good
> > luck. I think such an application is something Library Land sought for a
> > long time. “Can you say Holy Grail?"
> >
> > [1] Tika - https://tika.apache.org
> > [2] brain-dead script -
> > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> >
> > —
> > Eric
> >
>