LISTSERV 16.5 - CODE4LIB Archives

In addition to the approaches you note, might be worth investigating 
this tool that came up in a thread just a few days ago on this list:

http://wikipedia-miner.sourceforge.net/


I think nobody's done enough with this yet to be sure what will work 
best, I think you're going to have to experiment and let us know.

VIAF/OCLC services are presumably using some sort of statistical 
analysis/text mining approaches under the hood; wikipedia-miner is using 
such approaches but giving you the code in open source too if you're 
curious exactly what they're doing.  I suspect statistical approaches 
like wikipedia-miner uses are likely to be more productive than pure 
"parsing" approaches considering only one record at a time in 
isolation.   But writing your own statistics analysis algorithms is 
probably more work than you want, especially when wikipedia-miner and/or 
VIAF/OCLC services already exist.

If you don't do statistical analysis of the corpus, and do end up 
actually trying to search wikipedia directly -- then I suspect dbpedia 
is a lot more convenient endpoint than trying to screen-scrape HTML 
wikipedia. That's pretty much what dbpedia is for.

But these are all just my guesses, not informed by any work I've done.

Jonathan


On 5/19/2011 5:40 AM, graham wrote:
> I need to be able to take author data from a catalogue record and use it
> to look up the author on Wikipedia on the fly. So I may have birth date
> and possibly year of death in addition to (one spelling of) the name,
> the title of one book the author wrote etc.
>
> I know there are various efforts in progress that will improve the
> current situation, but as things stand at the moment what is the best*
> way to do this?
>
> 1. query wikipedia for as much as possible, parse and select the best
> fitting result
>
> 2. go via dbpedia/freebase and work back from there
>
> 3. use VIAF and/or OCLC services
>
> 4. Other?
>
> (I have no experience of 2-4 yet :-(
>
>
> Thanks
> Graham
> * 'best' being constrained by:
> - need to do this in real-time
> - need to avoid dependence on services which may be taken away
> or charged for
> - being able to justify to librarians as reasonably accurate :-)
>