Hi Matt,
I'm going to have someone on my end contact you directly, but for the other
code4lib-ers out there who are interested: it's a simple question with a
complicated answer. It depends on the corpus you have to work with. You
also have to prep your files carefully. The leading solution is a program
called Sakhr, but you have to spend time training it. Tesseract and Abbyy
work, too, but their accuracy depends on a variety of factors.
Best wishes,
Carol
On Fri, May 4, 2018 at 5:56 PM, Matt Sherman <[log in to unmask]>
wrote:
> Hi all,
>
> I was hoping someone could point me to some programs that might be
> helpful. I am helping a scholar plan a large scale digitization of his
> collection of Arabic books so he can work abroad and need to find out the
> best way to scan and OCR them. While I know generally how to look into the
> scanning of the books, though if anyone knows some good services that
> aren't too expensive let me know, the bigger question is how well we can
> OCR them. Does anyone have advice of how to run OCR on non-Roman character
> texts? Particularly in this case in Arabic. Any insights would be helpful
> as we put this plan together so can develop this project and its budget
> appropriately. Thanks for any information you folks can provide.
>
> Matt Sherman
>
--
Carol Kassel
Senior Manager, Digital Library Infrastructure
NYU Digital Library Technology Services
[log in to unmask]
(212) 992-9246
dlib.nyu.edu
|