LISTSERV 16.5 - CODE4LIB Archives

Your question reminded me that I wanted to do something similar for a
pile of PDFs from our institutional repository. So I made a small script
in Python that can do this, using the Python module from:
http://pybrary.net/pyPdf/

You can then put the following in a script: (mind the indentation)
---8<------
import os, sys
import pyPdf

if len(sys.argv) > 1:
  PATH = sys.argv[1]
else:
  PATH = '.'

for dirpath, dirnames, filenames in os.walk(PATH):
  for filename in filenames:
    try:
      filename_path = os.path.join(dirpath, filename)
      checked_file = pyPdf.PdfFileReader(file(filename_path, "rb"))
    except Exception, e:
      sys.stderr.write('%s :: %s\n' % (filename_path, e))
---8<---------

If you run it without arguments, it checks the current directory, if you
specify a path it will walk down the tree and check each file found.
If a file is not recognised as a PDF it barfs on stderr.

Have fun.

Etienne Posthumus
TU Delft Library   -  Digital Product Development
t: +31 (0) 15 27 81 949
m: [log in to unmask]
skype:  eposthumus
http://www.library.tudelft.nl/
Prometheusplein 1, 2628 ZC, Delft, Netherlands