It's not clear what is going into the folders at the end. The original
PDFs or the extracted text? If it's the former then it doesn't seem
like preserving the formatting would be a concern. Extract the text
using whatever, maintaining a reference to the original file, do your
categorization, then move the original into the proper folder.
On Tue, Aug 7, 2012 at 1:23 AM, Yong Tang <[log in to unmask]> wrote:
> I am a full time information science student and a part time LAMP server
> administrator. I was recently thrown into a file dumpster containing
> hundreds of old PDF files. My job is to clearn the dumpster up by putting
> right files into right folders. I am facing some difficulties when writing
> a Perl script to get the job done. I would appreciate it if you could help.
> First of all, what tool /tools do you use to manipulate PDF file directly in
> a script? I tried some Perl modules such as CAM::PDF and PDF::API2. The
> results were not pretty. The original text format was lost.
> I am regret that I did not take a XML class last semester, for I just get an
> intuition that the best way to do this job is to save the PDFs into XMLs,
> and then work on the XMLs with script. Instead, I have to save the PDFs into
> plain texts. I found PDFedit and Adobe Acrobat X Pro were good because both
> of them kept original text format after the conversion. However, I have no
> idea how to use them to save multiple PDFs into plain texts at once. I
> googled for the answers but found no luck. Anybody knows how to do it?
> I am new to text processing. Maybe I am heading in a wrong direction for
> this project? Any input is appreciated.
> Yong Tang
> A student