LISTSERV 16.5 - CODE4LIB Archives

On Feb 25, 2013, at 8:12 AM, Seth van Hooland <[log in to unmask]> wrote:

> You want to automate the discovery of people, place names and events within a large corpus of unstructured documents or metadata (e.g. description field)? Then you might want to use the Named-Entity Recognition (NER) extension for OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles).


Yes, named-entity recognition (NER) is fun. 

About a year ago I used a different application to do NER against about 100 digitized files. From my blog posting [0]:

  name-entity extraction – There was a desire to list the
  underlying names, places, and organizations from each text. These
  things can put a text into a context for the reader. Are there a
  lot of Irish names? Is there a preponderance of place names from
  the United States? To accomplish this task and assist in
  answering these sorts of questions, a Perl script was written
  around the Stanford Named Entity Recognizer. [1] This script
  (txt2ner.pl [2]) extracts the entities, looks them up in DBedia, and
  saves metadata (abstracts, URLs to images, as well as latitudes &
  longitudes) describing the entities to a locally defined XML file
  for later processing. (See an example. [3]) A CGI script (ner.cgi [4])
  was then written to provide a reader-interface to these files.

Once I "NER'ed" the files and saved the corresponding linked data, I was able to create a tablet-based interface providing the means for the reader to see how the words are used in context, but also read a blurb from wikipedia as well as map places via Google Maps. For example, some extracts from a book called An adventure With The Apaches [5] but the data is not as clean as I would hope. The whole thing was a part of a project we called the Catholic Youth Literature Project. [6]

The ELIS software looks pretty interesting. [7]

Fun with distant reading and NER.


[0] blog postding - http://blogs.nd.edu/emorgan/2012/03/cyl/
[1] Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] txt2ner.pl - http://dh.crc.nd.edu/sandbox/cyl/bin/txt2ner.pl
[3] intermediate XML file - http://dh.crc.nd.edu/sandbox/cyl/corpus/advicetoirishgir00cusa.ner
[4] CGI script - http://dh.crc.nd.edu/sandbox/cyl/bin/ner-cgi.pl
[5] Adventure - http://dh.crc.nd.edu/sandbox/cyl/catalog/details/adventurewithapa00ferriala.html
[6] Catholic Youth Literature - http://dh.crc.nd.edu/sandbox/cyl/catalog/
[7] ELIS - http://freeyourmetadata.org/named-entity-extraction/

--
Eric Lease Morgan
University of Notre Dame

574/631-8604