On Sat, Apr 27, 2013 at 9:37 PM, Andrew Hankinson <
[log in to unmask]> wrote:

> As someone who works on document recognition, I have to disagree. You
> should always keep an uncompressed original around, since you can never
> recover it without (often expensive) re-imaging. JPEG, or any other type of
> lossy compression, introduces artifacts that don't look "too bad" by the
> human eye, but have a significant effect on the quality of OCR. You can
> never recover this after you have discarded your originals.
> Big files are clunky to work with, which is why you should have an
> automated way of producing surrogate, compressed copies for general use,
> but like any archivist will tell you, a photocopy is not a replacement for
> the original.

All true, but keeping "just in case" copies of uncompressed files around
has significant disadvantages unless you have the resources to deal with
them. Any archivist will tell you they need the uncompressed files.
However, many of them don't have the disk space, bandwidth, staff
resources, etc to deal with these files and wind up doing things that are
far more dangerous like just having files sitting around on cheap external

Every choice people make is about loss. Equipment, optics, lighting, you
name it. But for some reason, the instant we're talking about bits of data
on a disk, people plan as though capacity were unlimited when most archives
are severely underresourced.

If you only have to deal with a few small projects, keeping uncompressed
images is no big deal. But let's suppose you have a million pages or more
-- this introduces a completely different cost structure that permanently
affects what resources you'll have for other projects in the future.
Objectives and available resources need to drive decisions unless we
believe that the best plan is to do what we'd do in an ideal world until
resources run out.