On Feb 25, 2013, at 8:12 AM, Seth van Hooland <[log in to unmask]> wrote: > You want to automate the discovery of people, place names and events within a large corpus of unstructured documents or metadata (e.g. description field)? Then you might want to use the Named-Entity Recognition (NER) extension for OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles). Yes, named-entity recognition (NER) is fun. About a year ago I used a different application to do NER against about 100 digitized files. From my blog posting [0]: name-entity extraction – There was a desire to list the underlying names, places, and organizations from each text. These things can put a text into a context for the reader. Are there a lot of Irish names? Is there a preponderance of place names from the United States? To accomplish this task and assist in answering these sorts of questions, a Perl script was written around the Stanford Named Entity Recognizer. [1] This script (txt2ner.pl [2]) extracts the entities, looks them up in DBedia, and saves metadata (abstracts, URLs to images, as well as latitudes & longitudes) describing the entities to a locally defined XML file for later processing. (See an example. [3]) A CGI script (ner.cgi [4]) was then written to provide a reader-interface to these files. Once I "NER'ed" the files and saved the corresponding linked data, I was able to create a tablet-based interface providing the means for the reader to see how the words are used in context, but also read a blurb from wikipedia as well as map places via Google Maps. For example, some extracts from a book called An adventure With The Apaches [5] but the data is not as clean as I would hope. The whole thing was a part of a project we called the Catholic Youth Literature Project. [6] The ELIS software looks pretty interesting. [7] Fun with distant reading and NER. [0] blog postding - http://blogs.nd.edu/emorgan/2012/03/cyl/ [1] Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml [2] txt2ner.pl - http://dh.crc.nd.edu/sandbox/cyl/bin/txt2ner.pl [3] intermediate XML file - http://dh.crc.nd.edu/sandbox/cyl/corpus/advicetoirishgir00cusa.ner [4] CGI script - http://dh.crc.nd.edu/sandbox/cyl/bin/ner-cgi.pl [5] Adventure - http://dh.crc.nd.edu/sandbox/cyl/catalog/details/adventurewithapa00ferriala.html [6] Catholic Youth Literature - http://dh.crc.nd.edu/sandbox/cyl/catalog/ [7] ELIS - http://freeyourmetadata.org/named-entity-extraction/ -- Eric Lease Morgan University of Notre Dame 574/631-8604