LISTSERV 16.5 - CODE4LIB Archives

Ben beat me to the punch in mentioning the iDigBio hackathon OCR project and his own project for handwriting transcription. So I'll add a few other things. First, I'll soon be prototyping a RESTful API for OCR using Tesseract so anyone who is interested in providing input or contributing code, please ping me. I'll be creating this in Python but have not determined what, if any, API framework I'll use so if anyone has suggestions about this, please let me know. The "short" list that needs to get shorter is flask, CherryPy, and (on the heavier side) various RESTful solutions within Django such as piston. I'll be starting on this when my plate gets a little more clear - hopefully in a month or less.

Michael Giddens has written a simple web service for Tesseract (see http://www.silverbiology.com/blog/2011/03/10/amazon-ec2-tesseract-ocr-thank-you/) You'd have to provide the hardware, but he's provided the code. I have not used this myself, but it looks very straightforward.

Lastly, I'd like to plug iDigBio (https://www.idigbio.org) and the Augmenting OCR working group (https://www.idigbio.org/wiki/index.php/IDigBio_Working_Groups) a bit more. The biocollections community is up against this text transcription/OCR bottleneck and we are hoping to develop stronger ties with other communities with similar problems. This is one reason why we scheduled the first iDigBio hackathon during the 2013 iConference here in Fort Worth - so we could try to introduce our challenges to the information and library science communities. So I look forward to continuing the discussion and hopefully we'll collaborate/converge on solutions that have broad impacts.

Jason

On Mar 12, 2013, at 10:00 PM, CODE4LIB automatic digest system wrote:

Date:    Tue, 12 Mar 2013 11:57:06 -0400
From:    Eric Lease Morgan <[log in to unmask]<mailto:[log in to unmask]>>
Subject: web-based ocr

Does anybody here know of a Web-based OCR program or Web service?

Many people want to do OCR against digitized texts. We all know of various OCR applications (Adobe Acrobat, ABBYY FineReader, Google's Tesseract, etc.), but they are not necessarily Web-based. As a service to my university, I thought it might be cool (or "kewl") to support an image to text application. Go to Web form. Submit one or more image files. Have OCR done against them no matter how dirty the output. Return plain text. As a bonus, the application would support a REST-ful API.

Does anybody know of something like this that exists already?


Jason Best
Biodiversity Informatician
Botanical Research Institute of Texas
1700 University Drive
Fort Worth, Texas 76107

817-332-4441 ext. 230
http://www.brit.org