In addition to the approaches you note, might be worth investigating
this tool that came up in a thread just a few days ago on this list:
I think nobody's done enough with this yet to be sure what will work
best, I think you're going to have to experiment and let us know.
VIAF/OCLC services are presumably using some sort of statistical
analysis/text mining approaches under the hood; wikipedia-miner is using
such approaches but giving you the code in open source too if you're
curious exactly what they're doing. I suspect statistical approaches
like wikipedia-miner uses are likely to be more productive than pure
"parsing" approaches considering only one record at a time in
isolation. But writing your own statistics analysis algorithms is
probably more work than you want, especially when wikipedia-miner and/or
VIAF/OCLC services already exist.
If you don't do statistical analysis of the corpus, and do end up
actually trying to search wikipedia directly -- then I suspect dbpedia
is a lot more convenient endpoint than trying to screen-scrape HTML
wikipedia. That's pretty much what dbpedia is for.
But these are all just my guesses, not informed by any work I've done.
On 5/19/2011 5:40 AM, graham wrote:
> I need to be able to take author data from a catalogue record and use it
> to look up the author on Wikipedia on the fly. So I may have birth date
> and possibly year of death in addition to (one spelling of) the name,
> the title of one book the author wrote etc.
> I know there are various efforts in progress that will improve the
> current situation, but as things stand at the moment what is the best*
> way to do this?
> 1. query wikipedia for as much as possible, parse and select the best
> fitting result
> 2. go via dbpedia/freebase and work back from there
> 3. use VIAF and/or OCLC services
> 4. Other?
> (I have no experience of 2-4 yet :-(
> * 'best' being constrained by:
> - need to do this in real-time
> - need to avoid dependence on services which may be taken away
> or charged for
> - being able to justify to librarians as reasonably accurate :-)