LISTSERV 16.5 - CODE4LIB Archives

On Sep 2, 2021, at 4:07 PM, Kimberly Kennedy <[log in to unmask]> wrote:

> I was wondering if anyone has created a script or tool to compare the words
> in a text file to a dictionary? I'm looking for a way to quantify the
> quality of OCR output. I've heard that counting the number of words that
> are in the dictionary is a good quick and dirty way to do this, but I would
> like to be able to run this script on larger batches of text files so I can
> compare OCR engines (not count words manually).
> 
> Let me know if you have any existing tools or thoughts about how to go
> about this!
> 
> --
> Kimberly Kennedy
> Digital Production Coordinator
> Northeastern University Library


While I have not explicitly written such a script, the algorithm is simple:

  count & tabulate the number of tokens in the OCR output, and in Python such a thing is ironically called a "dictionary". You may (or may not) want to normalize the tokens by lower-casing them, removing numbers, removing punctuation, etc

  create a simple list (array) of all the words in your dictionary

  set counter = 0

  for each OCR'ed token

    if token is in dictionary, then set counter = counter + number of times token appears in OCR

  set percentage = counter / total number of OCR'ed tokens

  done

If the percentage is 100, then all the OCR'ed tokens were found in the dictionary, and the OCR was perfect. If the percentage is less than 50%, then your OCR is more inaccurate than not.  :(  Given two plain text files (the OCR'ed text, and the list of dictionary words), this algorithm can easily be implemented in just about any programming language and in less than a couple dozen lines.

Actually, the hardest part would be getting the dictionary words. 

Fun with text mining. 

--
Eric Morgan
University of Notre Dame