Print

Print


Karen Coyle wrote:
> I know that 98% is impressive, but I always like to remember that with 
> an average of 2000 characters per page that means 40 potential errors 
> per book page. Just to give us some perspective on the level of 
> cleanup that will be needed for books being digitized today.
The "good" news from the perspective of searching is that a reasonable 
percentage of those errors will affect terms that are either rarely used 
in searching or are repeated correctly in the vicinity. 

The bad news:  phrase search is compromised. Screen readers for the 
visually impaired are compromised. Relevance that depends on term 
clustered is compromised.

If we had to correct it all: a) it would never get done and b) it would 
be better than some of the originals which are rife with typographic errors.

Walter
  so still regrets the Swedish Chef OCR of most microfilm newspaper projects