LISTSERV 16.5 - CODE4LIB Archives

I've been thinking about how to use named entities as metadata. How might I do so intelligently?

Given sets of full text, it is possible to extract named entities from a document. Named entities include things such as names of people, names of organizations, names of places, etc. These things can be quite informative when it comes to describing "aboutness".

Here at Notre Dame we have a collection of digitized Catholic pamphlets, and my named entity extractor (a Python library called spaCy), can loop through each sentence in a pamphlet and output named entities. Each pamphlet can have many named entities, and each entity can be repeated many times. The extractor is not perfect nor is the content. For example, the extractor might call a particular entity a person when in reality it is an organization or a place. After all, spaCy works off a model created through a machine learning process, and consequently spaCy does not alway guess things correctly. On the other hand, the pamphlets have been OCRed so the extracted entities are sometime "mis-spelled".

The pamphlets collection includes close to 1,500 documents totaling 97 MB of plain text. After extracting only the names of persons from the collection, I have close to .5 million names (rows of data). Again, each name can be listed more than once for each document.

My question is now two-fold. First, how do I go about normalizing ("cleaning") my names. I could use OpenRefine to normalize things, but unfortunately, OpenRefine does not seem to scale very well when it comes to .5 million rows of data. OpenRefine's coolest solution for normalizing is its clustering functions, and I believe I can rather easily implement a version of the Levenshtein algorithm (one of the clustering functions) in any number of computer languages including Python or SQL. Using Levenshtein I can then fix the various mis-spellings.

Second, assuming my entities have been normalized, which ones do I actually include as metadata? I could simply remove each of the duplicate entities associated with a given file, and then add them all. This results in a whole lot of names, and just because a name is mentioned one time does not necessarily justify inclusion as metadata. I could then say, "If a name is mentioned more than once, then it is justified for inclusion", but this policy breaks down if a document is really long; the document is still not "about" that name.

Instead, I think I need to implement some sort of weighting system. "Given all the entities extracted from a set of documents, only include those names which are (statistically) significant." I suppose I could implement some version of TF/IDF to derive weight and significance. Hmmmm...

Have you extracted named entities from full text and then included them in your metadata? If so, then how? What characteristics do the entries have to have to justify their inclusion as metadata? Inquiring minds would like to know.

--
Eric Morgan