LISTSERV 16.5 - CODE4LIB Archives

Yong,

I would: (1)  Take the text from the PDF, while keeping the text tied to
the file name.  (2)  Work with the text to categorize the documents.  (3)
Sort the file names by where I wanted the files to go.  (4)  Then use the
script to move files with the specified names to where I wanted them to go.

This is assuming unique filenames since all files are in one folder at once.

For a project I did using automated indexing for PDFs, I used A-PDF to
Excel Extractor ( http://www.a-pdf.com/to-excel/download.htm ) to put the
text of PDFs into Excel, then used Visual Basic in Excel to work the text.
Some descriptions of that project are here
http://fsulawrc.com/FACsearchengine.html and here
http://www.randtke.com/presentations/NASIG.html .  I didn't have coding
background, so was limited to tools with a short learning curve.  Your
total number of files is important in making a strategy.  I had 30,000 PDFs
to index.  If I had 500, then manually looking at each would probably have
been more efficient.

-Wilhelmina Randtke


On Tue, Aug 7, 2012 at 12:23 AM, Yong Tang <[log in to unmask]> wrote:

> Hi,
>
> I am a full time information science student and a part time LAMP server
> administrator. I was recently thrown into a file dumpster containing
> hundreds of old PDF files. My job is to clearn the dumpster up by putting
> right files into right folders.  I am facing some difficulties when writing
> a Perl script to get the job done. I would appreciate it if you could help.
>
> First of all, what tool /tools do you use to manipulate PDF file directly
> in a script? I tried some Perl modules such as CAM::PDF and PDF::API2. The
> results were not pretty. The original text format was lost.
>
> I am regret that I did not take a XML class last semester, for I just get
> an intuition that the best way to do this job is to save the PDFs into
> XMLs, and then work on the XMLs with script. Instead, I have to save the
> PDFs into plain texts. I found PDFedit and Adobe Acrobat X Pro were good
> because both of them kept original text format after the conversion.
> However, I have no idea how to use them to save multiple PDFs into plain
> texts at once.  I googled for the answers but found no luck.  Anybody knows
> how to do it?
>
> I am new to text processing. Maybe I am heading in a wrong direction for
> this project? Any input is appreciated.
>
> Yong Tang
> A student
>