LISTSERV 16.5 - CODE4LIB Archives

On Apr 8, 2016, at 5:13 PM, Jenn C <[log in to unmask]> wrote:

> I worked on a text mining project last semester where I had a bunch of
> magazines with text that was totally unstructured (from IA). I would have
> really liked to know how to work entity matching into such a project. Are
> there text mining projects out there that demonstrate doing this?

If I understand your question correctly, then the Stanford Name Entity Recognition (NER) library/application may be one solution. [1]

Given text as input, a named entity recognition library/application returns a list of nouns (names, places, and things). The things can be all sorts of stuff such as organizations, dates, times, fiscal amounts, etc. Stanford’s NER is really a Java library, but has a command-line interface. Feed it a text, and you get back an XML stream. The stream contains elements, and each element is expected to be some sort of entity. Be forewarned. For the the best and most optimal performance, it is necessary to “train” the library/application. Frankly, I’ve never done that, and consequently, I guess I’ve never been optimal.* You also might want to take a read of the text from the Python Natural Language Toolkit (NLTK) module. [2] The noted chapter gives a pretty good overview of the subject. 

[1] NER - http://nlp.stanford.edu/software/CRF-NER.shtml
[2] NLTK chapter - http://www.nltk.org/book/ch07.html

* ‘Story of my life.

—
Eric Lease Morgan