LISTSERV 16.5 - CODE4LIB Archives

Hi all,

I have an interesting assessment issue with some recently digitized
newspapers that I wondered if anyone could shed some light on. We sent a
batch of 19th century newspapers off to a vendor knowing they weren't in
great shape, and now we have to decide whether the resultant images (TIFFs)
are usable or we should be looking for alternative copies and/or microfilm.

A lot of the images are in decent shape, but the first few pages of each
issue are heavily creased and generally missing a smallish piece from the
center of the page where the folds met. I'm looking for a way to
programmatically identify how much text is missing/unusable for each page.
We haven't run OCR yet, part of this assessment is to figure out whether we
should bother sending these items out for OCR and METS/ALTO creation, but I
suspect we could run a quick and dirty in-house OCR if that would help.

We can go through the images by hand and try to measure and/or count, but
if anyone's worked on something like this or has thoughts, I'd love to hear
them!

Thanks,
Christine

-- 
Christine Mayo
Digital Production Librarian
Thomas P. O'Neill, Jr. Library
Boston College
140 Commonwealth Avenue
Chestnut Hill, MA 02467
[log in to unmask]