On Feb 25, 2013, at 8:12 AM, Seth van Hooland <[log in to unmask]> wrote:
> You want to automate the discovery of people, place names and events within a large corpus of unstructured documents or metadata (e.g. description field)? Then you might want to use the Named-Entity Recognition (NER) extension for OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles).
Yes, named-entity recognition (NER) is fun.
About a year ago I used a different application to do NER against about 100 digitized files. From my blog posting [0]:
name-entity extraction – There was a desire to list the
underlying names, places, and organizations from each text. These
things can put a text into a context for the reader. Are there a
lot of Irish names? Is there a preponderance of place names from
the United States? To accomplish this task and assist in
answering these sorts of questions, a Perl script was written
around the Stanford Named Entity Recognizer. [1] This script
(txt2ner.pl [2]) extracts the entities, looks them up in DBedia, and
saves metadata (abstracts, URLs to images, as well as latitudes &
longitudes) describing the entities to a locally defined XML file
for later processing. (See an example. [3]) A CGI script (ner.cgi [4])
was then written to provide a reader-interface to these files.
Once I "NER'ed" the files and saved the corresponding linked data, I was able to create a tablet-based interface providing the means for the reader to see how the words are used in context, but also read a blurb from wikipedia as well as map places via Google Maps. For example, some extracts from a book called An adventure With The Apaches [5] but the data is not as clean as I would hope. The whole thing was a part of a project we called the Catholic Youth Literature Project. [6]
The ELIS software looks pretty interesting. [7]
Fun with distant reading and NER.
[0] blog postding - http://blogs.nd.edu/emorgan/2012/03/cyl/
[1] Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] txt2ner.pl - http://dh.crc.nd.edu/sandbox/cyl/bin/txt2ner.pl
[3] intermediate XML file - http://dh.crc.nd.edu/sandbox/cyl/corpus/advicetoirishgir00cusa.ner
[4] CGI script - http://dh.crc.nd.edu/sandbox/cyl/bin/ner-cgi.pl
[5] Adventure - http://dh.crc.nd.edu/sandbox/cyl/catalog/details/adventurewithapa00ferriala.html
[6] Catholic Youth Literature - http://dh.crc.nd.edu/sandbox/cyl/catalog/
[7] ELIS - http://freeyourmetadata.org/named-entity-extraction/
--
Eric Lease Morgan
University of Notre Dame
574/631-8604
|