LISTSERV 16.5 - CODE4LIB Archives

  Yong Tang writes

> I was recently thrown into a file dumpster

  That sounds really painful.

> The original text format was lost.

  Extracting text from PDF is difficult. I'd try to use pdftohtml

http://sourceforge.net/projects/pdftohtml/

  I have used that in the past. Then use XML::LibXML's HTML parser to
  read the resulting HTML (if any) into Perl.

> Maybe I am heading in a wrong direction for this project?

  Direction seems right but the task is tough. PDF is where text
  goes to die. 

  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                      http://authorprofile.org/pkr1
                                               skype: thomaskrichel