Karen Coyle wrote:
> I know that 98% is impressive, but I always like to remember that with
> an average of 2000 characters per page that means 40 potential errors
> per book page. Just to give us some perspective on the level of
> cleanup that will be needed for books being digitized today.
The "good" news from the perspective of searching is that a reasonable
percentage of those errors will affect terms that are either rarely used
in searching or are repeated correctly in the vicinity.
The bad news: phrase search is compromised. Screen readers for the
visually impaired are compromised. Relevance that depends on term
clustered is compromised.
If we had to correct it all: a) it would never get done and b) it would
be better than some of the originals which are rife with typographic errors.
Walter
so still regrets the Swedish Chef OCR of most microfilm newspaper projects
|