I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I've  
read in a pile, and I was rather amazed at the size of the pile. It  
was about a foot tall. When I read these articles I "actively" read  
them -- meaning, I write, scribble, highlight, and annotate the text  
with my own special notation denoting names, keywords, definitions,  
citations, quotations, list items, examples, etc. This active reading  
process: 1) makes for better comprehension on my part, and 2) makes  
the articles easier to review and pick out the ideas I thought were  
salient. Being the librarian I am, I thought it might be cool ("kewl")  
to make the articles into a collection. Thus, the beginnings of  
Highlights & Annotations: A Value-Added Reading List.

The techno-weenie process for creating and maintaining the content is  
something this community might find interesting:

  1. Print article and read it actively.

  2. Convert the printed article into a PDF
     file -- complete with embedded OCR --
     with my handy-dandy ScanSnap scanner. [1]

  3. Use MyLibrary to create metadata (author,
     title, date published, date read, note,
     keywords, facet/term combinations, local
     and remote URLs, etc.) describing the
     article. [2]

  4. Save the PDF to my file system.

  5. Use pdttotext to extract the OCRed text
     from the PDF and index it along with
     the MyLibrary metadata using Solr. [3, 4]

  6. Provide a searchable/browsable user
     interface to the collection through a
     mod_perl module. [5, 6]

Software is never done, and if it were then it would be called  
hardware. Accordingly, I know there are some things I need to do  
before I can truely deem the system version 1.0. At the same time my  
excitment is overflowing and I thought I'd share some geekdom with my  
fellow hackers. Fun with PDF files and open source software.

[1] ScanSnap -
[2] MyLibrary screen dump -
[3] pdftotext -
[4] Solr -
[5] module source code -
[6] user interface -

Eric Lease Morgan
University of Notre Dame

Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604