Hi all: I agree with Kyle. My research unit does this as one focus of research (not production), and it is tough to find datasets and good solutions. The closest "dataset" I can think of is those of the Web People Search Task (e.g. WePS-3 http://nlp.uned.es/weps/), but these are not limited to academics who write papers. Another choice is to use DBLP -- Michael Ley and his team of Universität Trier have made it a goal of their largely computer science bibliography to be able to disambiguate namesakes. I know it used to have this data in their XML dumps, but I can't seem to find a pointer to it right now aside from (http://www.informatik.uni-trier.de/~ley/db/about/mebi.html) Hope that help! Cheers, Min -- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) :: [log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W) Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you. On Wed, Jul 10, 2013 at 12:22 AM, Kyle Banerjee <[log in to unmask]> wrote: > Author disambiguation is a tough one -- I don't think you'll find any > unique identifier and ORCID is not a viable method at this time. Email is > not a good identifier because authors change affiliations, are sometimes > known by more than one email at a single institution, and because this info > is not always available depending on where you get your data from. > > Using some combination of names, email, affiliation, coauthors, journal, > topic, etc, you can probably improve accuracy, but it's going to be messy. > > What is the use case you are trying to address? For example, the best > method may be very different if you're trying to disambiguate authors from > a single institution than if you're trying to solve a generic problem over > a huge corpus of data. > > ISSN is also not super clean as a unique identifier even if it is very > useful -- single titles can have multiple ISSNs for different versions or > title changes that might not be perceived as different from people. > > kyle > > > On Tue, Jul 9, 2013 at 8:32 AM, Paul Albert <[log in to unmask]> wrote: > >> I am exploring methods for author disambiguation, and I would like to have >> access to one or more set of well-disambiguated data set containing: >> – a unique author identifier (email address, institutional identifier) >> – a unique article identifier (PMID, DOI, etc.) >> – a unique journal identifier (ISSN) >> >> Definition for "well-disambiguated" – for a given set of authors, you know >> the identity of their journal articles to a precision and recall of greater >> than 90-95%. >> >> Any ideas? >> >> thanks, >> Paul >> >> >> Paul Albert >> Project Manager, VIVO >> Weill Cornell Medical Library >> 646.962.2551 >>