In addition to the approaches you note, might be worth investigating this tool that came up in a thread just a few days ago on this list: http://wikipedia-miner.sourceforge.net/ I think nobody's done enough with this yet to be sure what will work best, I think you're going to have to experiment and let us know. VIAF/OCLC services are presumably using some sort of statistical analysis/text mining approaches under the hood; wikipedia-miner is using such approaches but giving you the code in open source too if you're curious exactly what they're doing. I suspect statistical approaches like wikipedia-miner uses are likely to be more productive than pure "parsing" approaches considering only one record at a time in isolation. But writing your own statistics analysis algorithms is probably more work than you want, especially when wikipedia-miner and/or VIAF/OCLC services already exist. If you don't do statistical analysis of the corpus, and do end up actually trying to search wikipedia directly -- then I suspect dbpedia is a lot more convenient endpoint than trying to screen-scrape HTML wikipedia. That's pretty much what dbpedia is for. But these are all just my guesses, not informed by any work I've done. Jonathan On 5/19/2011 5:40 AM, graham wrote: > I need to be able to take author data from a catalogue record and use it > to look up the author on Wikipedia on the fly. So I may have birth date > and possibly year of death in addition to (one spelling of) the name, > the title of one book the author wrote etc. > > I know there are various efforts in progress that will improve the > current situation, but as things stand at the moment what is the best* > way to do this? > > 1. query wikipedia for as much as possible, parse and select the best > fitting result > > 2. go via dbpedia/freebase and work back from there > > 3. use VIAF and/or OCLC services > > 4. Other? > > (I have no experience of 2-4 yet :-( > > > Thanks > Graham > * 'best' being constrained by: > - need to do this in real-time > - need to avoid dependence on services which may be taken away > or charged for > - being able to justify to librarians as reasonably accurate :-) >