LISTSERV 16.5 - CODE4LIB Archives

Very cool tool, thank you!

Putting my devil's advocate hat on, it doesn't parse foreign documents well
(I got it to break!).  I also got inconsistent results feeding it PDF files
with tables embedded (but haven't been able to figure out what it is about
them it doesn't like).

Just from a curiosity standpoint, what encoding is being utilized?  I know
nothing about Perl.  It seemed to have no problem parsing a dash (-) if it
was up against another character (2007-2012), but barfs when it's by itself
(2007 � 2012). I'm only referring to 'extracted text' mode.

If it helps, I can send along *most* of my test PDF files used.

Thank you!
.m





On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan <[log in to unmask]> wrote:

> On Oct 11, 2013, at 1:49 PM, Matthew Sherman <[log in to unmask]>
> wrote:
>
> >> For a limited period of time I am making publicly available a Web-based
> >> program called PDF2TXT -- http://bit.ly/1bJRyh8
> >
> > Very slick, good work.  I can see where this tool can be very helpful.
>  It
> > does have some issues with some characters, but this is rather common
> with
> > most systems.
>
> Again, thank you for the support. Yes, there are some escaping issues to
> be resolved. "Release early. Release often." I need help with the graphic
> design in general.
>
> Here's an enhancement I thought of:
>
>   1. allow readers to authenticate
>   2. allow readers to upload documents
>   3. documents get saved in readers' cache
>   4. allow interface to list documents in the cache
>   5. provide text mining services against reader-selected documents
>   6. go to Step #1
>
> It would also be cool if I could figure out how to finish the installation
> of Tesseract to enable OCRing. [1]
>
> [1] OCRing -
> http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html
>
> --
> Eric Morgan
>