LISTSERV 16.5 - CODE4LIB Archives

On February 2, Walter Lewis wrote:

> The "good" news from the perspective of searching is that a 
> reasonable percentage of those errors will affect terms that are 
> either rarely used in searching or are repeated correctly in the 
> vicinity.

This is why OCR should be done by a search engine company (such as 
Google), which has statistics on what real people really search 
for, and can improve the OCR process as it goes.  Software 
developing companies such as ABBYY or Omnipage never get that kind 
of feedback from actual users.  They only represent a fraction of 
the entire feedback loop.  All my experience of scanning old 
Swedish and Danish books with ABBYY Finereader, never got back to 
ABBYY, they never asked for any of that feedback.

I have no idea to what degree Google Book Search does this right, 
but by controlling the entire scan-search loop they have one 
excuse less to fail.


-- 
  Lars Aronsson ([log in to unmask])
  Aronsson Datateknik - http://aronsson.se