Print

Print


Hello,

Best tool I've found for pdf text extraction is Apache Tika (
http://tika.apache.org/ ). It's in java, but there's a pre-made runnable
jar application you can use to extract text...

You can just write a simple script to find all the pdf files and run it
through tika, but that would probably be very slow, since you'll be
starting/stopping java each time you process a file..a quicker way to go
would be to start the tika server and feed it files..like so:

$ java -jar tika-app-1.2.jar --text -s 2000
(this starts the server on port 2000. if you want xml just use --xml
instead of --text )
find . -name "*.pdf" -exec sh -c "nc localhost 2000 < '{}' > '{}_tika.txt'"
\;
(this find all the files named *.pdf, then submits that file to the tika
server on localhost:2000, then files the output into a file name that will
have _tika.txt appended to it. So, pdf_dumpster/1/1.pdf will now have a
text file of pdf_dumpster/1/1.pdf_tika.txt )

The text output will be good be not perfect...footnotes, for example, might
look a little funny.

This will work if there's text embedded in the pdf. If there's not, you'll
have to use an OCR to generate text. For that, I'd recommend using DocSplit
(https://github.com/documentcloud/docsplit), which is a great utility from
 Document Cloud. Their documentation is pretty good, so have a look if
you're needing ocr text generated....


b,chris.




On Tue, Aug 7, 2012 at 7:23 AM, Yong Tang <[log in to unmask]> wrote:

> Hi,
>
> I am a full time information science student and a part time LAMP server
> administrator. I was recently thrown into a file dumpster containing
> hundreds of old PDF files. My job is to clearn the dumpster up by putting
> right files into right folders.  I am facing some difficulties when writing
> a Perl script to get the job done. I would appreciate it if you could help.
>
> First of all, what tool /tools do you use to manipulate PDF file directly
> in a script? I tried some Perl modules such as CAM::PDF and PDF::API2. The
> results were not pretty. The original text format was lost.
>
> I am regret that I did not take a XML class last semester, for I just get
> an intuition that the best way to do this job is to save the PDFs into
> XMLs, and then work on the XMLs with script. Instead, I have to save the
> PDFs into plain texts. I found PDFedit and Adobe Acrobat X Pro were good
> because both of them kept original text format after the conversion.
> However, I have no idea how to use them to save multiple PDFs into plain
> texts at once.  I googled for the answers but found no luck.  Anybody knows
> how to do it?
>
> I am new to text processing. Maybe I am heading in a wrong direction for
> this project? Any input is appreciated.
>
> Yong Tang
> A student
>