That is a pretty good summation of it yes. I appreciate the suggestions,
this is a bit of a new realm for me and while I know what I want it to do
and the structure I want to put it in, the conversion process has been
eluding me so thanks for giving me some tools to look into.
On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]> wrote:
> On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]>
> wrote:
>
> > I am working with colleague on a side project which involves some scanned
> > bibliographies and making them more web searchable/sortable/browse-able.
> > While I am quite familiar with the metadata and organization aspects we
> > need, but I am at a bit of a loss on how to automate the process of
> putting
> > the bibliography in a more structured format so that we can avoid going
> > through hundreds of pages by hand. I am pretty sure regular expressions
> > are needed, but I have not had an instance where I need to automate
> > extracting data from one file type (PDF OCR or text extracted to Word
> doc)
> > and place it into another (either a database or an XML file) with some
> > enrichment. I would appreciate any suggestions for approaches or tools
> to
> > look into. Thanks for any help/thoughts people can give.
>
>
> If I understand your question correctly, then you have two problems to
> address: 1) converting PDF, Word, etc. files into plain text, and 2)
> marking up the result (which is a bibliography) into structure data.
> Correct?
>
> If so, then if your PDF documents have already been OCRed, or if you have
> other files, then you can probably feed them to TIKA to quickly and easily
> extract the underlying plain text. [1] I wrote a brain-dead shell script to
> run TIKA in server mode and then convert Word (.docx) files. [2]
>
> When it comes to marking up the result into structured data, well, good
> luck. I think such an application is something Library Land sought for a
> long time. “Can you say Holy Grail?"
>
> [1] Tika - https://tika.apache.org
> [2] brain-dead script -
> https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>
> —
> Eric
>
|