LISTSERV 16.5 - CODE4LIB Archives

Hi Mark,

I suspect the tool wil only be able to handle select languages, and very
doubtful you could develop a tool to handle non-LCG text.

For a fully internationalised tool, you would have fo ignore all text
layers in a PDF and run all PDFs through OCR to generate text.

Then you'd need to apply very sophisticated word boundary identification
routines.

A.
On 12/10/2013 9:40 AM, "Mark Pernotto" <[log in to unmask]> wrote:

> Very cool tool, thank you!
>
> Putting my devil's advocate hat on, it doesn't parse foreign documents well
> (I got it to break!).  I also got inconsistent results feeding it PDF files
> with tables embedded (but haven't been able to figure out what it is about
> them it doesn't like).
>
> Just from a curiosity standpoint, what encoding is being utilized?  I know
> nothing about Perl.  It seemed to have no problem parsing a dash (-) if it
> was up against another character (2007-2012), but barfs when it's by itself
> (2007 � 2012). I'm only referring to 'extracted text' mode.
>
> If it helps, I can send along *most* of my test PDF files used.
>
> Thank you!
> .m
>
>
>
>
>
> On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan <[log in to unmask]>
> wrote:
>
> > On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]>
> > wrote:
> >
> > >> For a limited period of time I am making publicly available a
> Web-based
> > >> program called PDF2TXT -- http://bit.ly/1bJRyh8
> > >
> > > Very slick, good work.  I can see where this tool can be very helpful.
> >  It
> > > does have some issues with some characters, but this is rather common
> > with
> > > most systems.
> >
> > Again, thank you for the support. Yes, there are some escaping issues to
> > be resolved. "Release early. Release often." I need help with the graphic
> > design in general.
> >
> > Here's an enhancement I thought of:
> >
> >   1. allow readers to authenticate
> >   2. allow readers to upload documents
> >   3. documents get saved in readers' cache
> >   4. allow interface to list documents in the cache
> >   5. provide text mining services against reader-selected documents
> >   6. go to Step #1
> >
> > It would also be cool if I could figure out how to finish the
> installation
> > of Tesseract to enable OCRing. [1]
> >
> > [1] OCRing -
> > http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html
> >
> > --
> > Eric Morgan
> >
>