Yong Tang writes
> I was recently thrown into a file dumpster
That sounds really painful.
> The original text format was lost.
Extracting text from PDF is difficult. I'd try to use pdftohtml
http://sourceforge.net/projects/pdftohtml/
I have used that in the past. Then use XML::LibXML's HTML parser to
read the resulting HTML (if any) into Perl.
> Maybe I am heading in a wrong direction for this project?
Direction seems right but the task is tough. PDF is where text
goes to die.
Cheers,
Thomas Krichel http://openlib.org/home/krichel
http://authorprofile.org/pkr1
skype: thomaskrichel
|