Hi Matt,

I'm going to have someone on my end contact you directly, but for the other
code4lib-ers out there who are interested: it's a simple question with a
complicated answer. It depends on the corpus you have to work with. You
also have to prep your files carefully. The leading solution is a program
called Sakhr, but you have to spend time training it. Tesseract and Abbyy
work, too, but their accuracy depends on a variety of factors.

Best wishes,


On Fri, May 4, 2018 at 5:56 PM, Matt Sherman <[log in to unmask]>

> Hi all,
> I was hoping someone could point me to some programs that might be
> helpful.  I am helping a scholar plan a large scale digitization of his
> collection of Arabic books so he can work abroad and need to find out the
> best way to scan and OCR them.  While I know generally how to look into the
> scanning of the books, though if anyone knows some good services that
> aren't too expensive let me know, the bigger question is how well we can
> OCR them.  Does anyone have advice of how to run OCR on non-Roman character
> texts?  Particularly in this case in Arabic.  Any insights would be helpful
> as we put this plan together so can develop this project and its budget
> appropriately.  Thanks for any information you folks can provide.
> Matt Sherman

Carol Kassel
Senior Manager, Digital Library Infrastructure
NYU Digital Library Technology Services
[log in to unmask]
(212) 992-9246