LISTSERV 16.5 - CODE4LIB Archives

On May 17, 2011, at 11:22 AM, Eric Lease Morgan wrote:

>> What are some of the ways to best insert Linked Data endpoints into an
>> XML file?... Given a name -- say, Plato or Thoreau -- how would one go about
>> identifying good endpoints?
> 
> When and if I do this work, I think I will use DBpedia and their lookup service. [1] Here's how:
> 
>  * do named-entity recognition (NER) against my documents
>  * for each name, place or organization element in the resulting XML
>    o query DBpedia for URIs via their lookup service
>    o add 1 or more of the resulting URIs as attributes
>      of the name, place, or organization element
>  * end for
> 
> Once done I could use the enhanced XML file as the raw source for providing cool (and "kewl") services against the text -- word clouds, definitions, geo-locations, images, abstracts, find similar,purchase, print, do concordance against, etc.


I've made some progress towards enriching my documents with Linked Data endpoints.

Using the Stanford NER, I am able to create a rudimentary XML stream where the names, places, and organizations are marked up. [1] I then modify the XML to include tallies of the entities as well as the most significant links from DBedia. Finally, I output the resulting XML to STDOUT. This process works for any plain text. See txt2ner.pl. [2] I've created about six .ner files. [3] 

The idea is then to allow the reader to: 1) read the document, 2) see at a glance what named entities exist in the document, and 3) do things with the named entities. I started writing such an interface for desktop browsers, but the real goal is to create one for tablet devices. [4, 5] I got a bit stymied on both.

In the end I hope to allow the person to select a named entity, automatically retrieve the content of the Linked Data end-point, and return a palette of choices allowing the reader to see a map, display a picture, get a definition, find related items, purchase the item, print the item, etc. As alluded to previously in this thread, one of the bigger challenges will be disambiguation. I see a crowd sourced solution in my future.

I want to the thank the XML4Lib community for helping me out with some -- of what I thought was -- gnarly XPath syntax. The group was VERY responsive and really accurate. "Thank you!"


[1] NER - http://bit.ly/e0SnA6
[2] txt2ner.pl - http://bit.ly/jQRjRH
[3] .ner files - http://bit.ly/lJ8wKU
[4] desktop interface - http://bit.ly/k4U6SZ
[5] tablet interface - http://bit.ly/kueBm9

-- 
Eric Lease Morgan
University of Notre Dame

Great Books Survey -- http://bit.ly/auPD9Q