On February 2, Walter Lewis wrote:
> The "good" news from the perspective of searching is that a
> reasonable percentage of those errors will affect terms that are
> either rarely used in searching or are repeated correctly in the
> vicinity.
This is why OCR should be done by a search engine company (such as
Google), which has statistics on what real people really search
for, and can improve the OCR process as it goes. Software
developing companies such as ABBYY or Omnipage never get that kind
of feedback from actual users. They only represent a fraction of
the entire feedback loop. All my experience of scanning old
Swedish and Danish books with ABBYY Finereader, never got back to
ABBYY, they never asked for any of that feedback.
I have no idea to what degree Google Book Search does this right,
but by controlling the entire scan-search loop they have one
excuse less to fail.
--
Lars Aronsson ([log in to unmask])
Aronsson Datateknik - http://aronsson.se
|