On Feb 25, 2013, at 8:12 AM, Seth van Hooland <[log in to unmask]> wrote:
> You want to automate the discovery of people, place names and events within a large corpus of unstructured documents or metadata (e.g. description field)? Then you might want to use the Named-Entity Recognition (NER) extension for OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles).
Yes, named-entity recognition (NER) is fun.
About a year ago I used a different application to do NER against about 100 digitized files. From my blog posting :
name-entity extraction – There was a desire to list the
underlying names, places, and organizations from each text. These
things can put a text into a context for the reader. Are there a
lot of Irish names? Is there a preponderance of place names from
the United States? To accomplish this task and assist in
answering these sorts of questions, a Perl script was written
around the Stanford Named Entity Recognizer.  This script
(txt2ner.pl ) extracts the entities, looks them up in DBedia, and
saves metadata (abstracts, URLs to images, as well as latitudes &
longitudes) describing the entities to a locally defined XML file
for later processing. (See an example. ) A CGI script (ner.cgi )
was then written to provide a reader-interface to these files.
Once I "NER'ed" the files and saved the corresponding linked data, I was able to create a tablet-based interface providing the means for the reader to see how the words are used in context, but also read a blurb from wikipedia as well as map places via Google Maps. For example, some extracts from a book called An adventure With The Apaches  but the data is not as clean as I would hope. The whole thing was a part of a project we called the Catholic Youth Literature Project. 
The ELIS software looks pretty interesting. 
Fun with distant reading and NER.
 blog postding - http://blogs.nd.edu/emorgan/2012/03/cyl/
 Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml
 txt2ner.pl - http://dh.crc.nd.edu/sandbox/cyl/bin/txt2ner.pl
 intermediate XML file - http://dh.crc.nd.edu/sandbox/cyl/corpus/advicetoirishgir00cusa.ner
 CGI script - http://dh.crc.nd.edu/sandbox/cyl/bin/ner-cgi.pl
 Adventure - http://dh.crc.nd.edu/sandbox/cyl/catalog/details/adventurewithapa00ferriala.html
 Catholic Youth Literature - http://dh.crc.nd.edu/sandbox/cyl/catalog/
 ELIS - http://freeyourmetadata.org/named-entity-extraction/
Eric Lease Morgan
University of Notre Dame