Weıre actually also working on getting a bibliography from a Word Doc to a
more structured format. Weıre using regular expressions in LibreOffice
Writer to mark up the citations, then insert tabs between the elements,
and then copy into a spreadsheet (similar to whatıs described in
http://programminghistorian.org/lessons/understanding-regular-expressions).
However, our bibliography was originally a Word Doc, not OCRed text. This
method is pretty reliant on consistent formatting, though, so messy OCR
could complicate things. Another thing to note is that itıs easiest when
you know what format the citation is for (e.g., a book or article), since
that impacts how the citation is structured. Iıd be happy to provide a
sample citation in each step of the process.
All the best,
Bonnie
On 6/18/15, 1:52 PM, "Matt Sherman" <[log in to unmask]> wrote:
>The hope is to take these bibliographies put it into more of a web
>searchable/sortable format for researchers to make use out of them. My
>colleague was taking some inspiration from the Marlowe Bibliography (
>https://marlowebibliography.org/), though we are hoping to possibly get a
>bit more robust with the bibliography we are working on. The important
>first step it to be able to parse the existing OCRed bibliography scans we
>have into a database, possibly a custom XML format but a database will
>probably be easier to append and expand down the road.
>
>On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]>
>wrote:
>
>> How you want to preprocess and structure the data depends on what you
>>hope
>> to achieve. Can you say more about what you want the end product to look
>> like?
>>
>> kyle
>>
>> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman
>><[log in to unmask]>
>> wrote:
>>
>> > That is a pretty good summation of it yes. I appreciate the
>>suggestions,
>> > this is a bit of a new realm for me and while I know what I want it
>>to do
>> > and the structure I want to put it in, the conversion process has been
>> > eluding me so thanks for giving me some tools to look into.
>> >
>> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]>
>> wrote:
>> >
>> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman
>><[log in to unmask]>
>> > > wrote:
>> > >
>> > > > I am working with colleague on a side project which involves some
>> > scanned
>> > > > bibliographies and making them more web
>> > searchable/sortable/browse-able.
>> > > > While I am quite familiar with the metadata and organization
>>aspects
>> we
>> > > > need, but I am at a bit of a loss on how to automate the process
>>of
>> > > putting
>> > > > the bibliography in a more structured format so that we can avoid
>> going
>> > > > through hundreds of pages by hand. I am pretty sure regular
>> > expressions
>> > > > are needed, but I have not had an instance where I need to
>>automate
>> > > > extracting data from one file type (PDF OCR or text extracted to
>>Word
>> > > doc)
>> > > > and place it into another (either a database or an XML file) with
>> some
>> > > > enrichment. I would appreciate any suggestions for approaches or
>> tools
>> > > to
>> > > > look into. Thanks for any help/thoughts people can give.
>> > >
>> > >
>> > > If I understand your question correctly, then you have two problems
>>to
>> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
>> > > marking up the result (which is a bibliography) into structure data.
>> > > Correct?
>> > >
>> > > If so, then if your PDF documents have already been OCRed, or if you
>> have
>> > > other files, then you can probably feed them to TIKA to quickly and
>> > easily
>> > > extract the underlying plain text. [1] I wrote a brain-dead shell
>> script
>> > to
>> > > run TIKA in server mode and then convert Word (.docx) files. [2]
>> > >
>> > > When it comes to marking up the result into structured data, well,
>>good
>> > > luck. I think such an application is something Library Land sought
>>for
>> a
>> > > long time. ³Can you say Holy Grail?"
>> > >
>> > > [1] Tika - https://tika.apache.org
>> > > [2] brain-dead script -
>> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>> > >
>> > >
>> > > Eric
>> > >
>> >
>>
|