LISTSERV 16.5 - CODE4LIB Archives

See also http://wiki.tei-c.org/index.php/Heuristics , which discusses 
this problem more broadly conceived.  I've just added a link to the 
archives of this very discussion.  --Kevin

On 6/18/15 12:52 PM, Matt Sherman wrote:
> The hope is to take these bibliographies put it into more of a web
> searchable/sortable format for researchers to make use out of them.  My
> colleague was taking some inspiration from the Marlowe Bibliography (
> https://marlowebibliography.org/), though we are hoping to possibly get a
> bit more robust with the bibliography we are working on.  The important
> first step it to be able to parse the existing OCRed bibliography scans we
> have into a database, possibly a custom XML format but a database will
> probably be easier to append and expand down the road.
>
> On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[log in to unmask]>
> wrote:
>
>> How you want to preprocess and structure the data depends on what you hope
>> to achieve. Can you say more about what you want the end product to look
>> like?
>>
>> kyle
>>
>> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[log in to unmask]>
>> wrote:
>>
>>> That is a pretty good summation of it yes.  I appreciate the suggestions,
>>> this is a bit of a new realm for me and while I know what I want it to do
>>> and the structure I want to put it in, the conversion process has been
>>> eluding me so thanks for giving me some tools to look into.
>>>
>>> On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[log in to unmask]>
>> wrote:
>>>
>>>> On Jun 18, 2015, at 12:02 PM, Matt Sherman <[log in to unmask]>
>>>> wrote:
>>>>
>>>>> I am working with colleague on a side project which involves some
>>> scanned
>>>>> bibliographies and making them more web
>>> searchable/sortable/browse-able.
>>>>> While I am quite familiar with the metadata and organization aspects
>> we
>>>>> need, but I am at a bit of a loss on how to automate the process of
>>>> putting
>>>>> the bibliography in a more structured format so that we can avoid
>> going
>>>>> through hundreds of pages by hand.  I am pretty sure regular
>>> expressions
>>>>> are needed, but I have not had an instance where I need to automate
>>>>> extracting data from one file type (PDF OCR or text extracted to Word
>>>> doc)
>>>>> and place it into another (either a database or an XML file) with
>> some
>>>>> enrichment.  I would appreciate any suggestions for approaches or
>> tools
>>>> to
>>>>> look into.  Thanks for any help/thoughts people can give.
>>>>
>>>>
>>>> If I understand your question correctly, then you have two problems to
>>>> address: 1) converting PDF, Word, etc. files into plain text, and 2)
>>>> marking up the result (which is a bibliography) into structure data.
>>>> Correct?
>>>>
>>>> If so, then if your PDF documents have already been OCRed, or if you
>> have
>>>> other files, then you can probably feed them to TIKA to quickly and
>>> easily
>>>> extract the underlying plain text. [1] I wrote a brain-dead shell
>> script
>>> to
>>>> run TIKA in server mode and then convert Word (.docx) files. [2]
>>>>
>>>> When it comes to marking up the result into structured data, well, good
>>>> luck. I think such an application is something Library Land sought for
>> a
>>>> long time. “Can you say Holy Grail?"
>>>>
>>>> [1] Tika - https://tika.apache.org
>>>> [2] brain-dead script -
>>>> https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>>>>
>>>> —
>>>> Eric
>>>>
>>>
>>