I agree with Kyle. My research unit does this as one focus of
research (not production), and it is tough to find datasets and good
solutions. The closest "dataset" I can think of is those of the Web
People Search Task (e.g. WePS-3 http://nlp.uned.es/weps/), but these
are not limited to academics who write papers.
Another choice is to use DBLP -- Michael Ley and his team of
Universität Trier have made it a goal of their largely computer
science bibliography to be able to disambiguate namesakes. I know it
used to have this data in their XML dumps, but I can't seem to find a
pointer to it right now aside from
Hope that help!
Min-Yen KAN (Dr) :: Associate Professor :: National University of
Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive
Singapore 117417 :: 65-6516 1885(DID) :: 65-6779 4580 (Fax) ::
[log in to unmask] (E) :: www.comp.nus.edu.sg/~kanmy (W)
Important: This email is confidential and may be privileged. If you
are not the intended recipient, please delete it and notify us
immediately; you should not copy or use it for any purpose, nor
disclose its contents to any other person. Thank you.
On Wed, Jul 10, 2013 at 12:22 AM, Kyle Banerjee <[log in to unmask]> wrote:
> Author disambiguation is a tough one -- I don't think you'll find any
> unique identifier and ORCID is not a viable method at this time. Email is
> not a good identifier because authors change affiliations, are sometimes
> known by more than one email at a single institution, and because this info
> is not always available depending on where you get your data from.
> Using some combination of names, email, affiliation, coauthors, journal,
> topic, etc, you can probably improve accuracy, but it's going to be messy.
> What is the use case you are trying to address? For example, the best
> method may be very different if you're trying to disambiguate authors from
> a single institution than if you're trying to solve a generic problem over
> a huge corpus of data.
> ISSN is also not super clean as a unique identifier even if it is very
> useful -- single titles can have multiple ISSNs for different versions or
> title changes that might not be perceived as different from people.
> On Tue, Jul 9, 2013 at 8:32 AM, Paul Albert <[log in to unmask]> wrote:
>> I am exploring methods for author disambiguation, and I would like to have
>> access to one or more set of well-disambiguated data set containing:
>> – a unique author identifier (email address, institutional identifier)
>> – a unique article identifier (PMID, DOI, etc.)
>> – a unique journal identifier (ISSN)
>> Definition for "well-disambiguated" – for a given set of authors, you know
>> the identity of their journal articles to a precision and recall of greater
>> than 90-95%.
>> Any ideas?
>> Paul Albert
>> Project Manager, VIVO
>> Weill Cornell Medical Library