Yong Tang writes > I was recently thrown into a file dumpster That sounds really painful. > The original text format was lost. Extracting text from PDF is difficult. I'd try to use pdftohtml http://sourceforge.net/projects/pdftohtml/ I have used that in the past. Then use XML::LibXML's HTML parser to read the resulting HTML (if any) into Perl. > Maybe I am heading in a wrong direction for this project? Direction seems right but the task is tough. PDF is where text goes to die. Cheers, Thomas Krichel http://openlib.org/home/krichel http://authorprofile.org/pkr1 skype: thomaskrichel