LISTSERV 16.5 - CODE4LIB Archives

We use Abby Finereader for things that will need correction (yearbooks
where the text was handwritten, for example), and Acrobat for things that
we're not willing to spend the time correcting. Finereader is good if you
really want the OCR perfectly formatted, as it can handle tables and charts
and vertical text and such, and would be especially useful if you're
planning for providing access to people with disabilities, where the
quality of formatting matters.

On Jul 20, 2017 8:25 AM, "Mark Watkins" <[log in to unmask]> wrote:

> I have a recently released a bookclub - related app called Bookship, which
> features the ability to scan a page of text from a book so users can post
> quotes. (www.bookshipapp.com). So my use case is people taking pictures
> of pages with their phone and OCR-ing it.
>
> I extensively tested Tesseract (an open source project at this point, not
> a formal Google product I don't think), and compared it Google Cloud Vision
> API's OCR product (https://cloud.google.com/vision/). For my use case,
> Google Cloud API blew away Tesseract. Tesseract really struggled with
> images that weren't perfectly vertical/horizontal and had difficulty
> dealing with the top and bottom of images (i.e. if a line got cut in half
> by the picture, Tesseract produced a few lines of gibberish at the top. The
> Google Cloud API seems to be nearly flawless at all of that. And was an
> order of magnitude faster. And also provides additional features (entity
> extraction, objectionable content, etc).
>
> Of course, Tesseract is free and the Google product requires licensing -
> although provides a limited (1000/month I think) for free.
>
> And of course these results may be due to my use case or my incorrect
> setup somehow..
>
> Your Mileage May Vary :)
>
> Mark
>