Thanks for all the ideas!
I'm a exploring a few of them this morning.
Patrick Galligan
Rockefeller Archive Center
Assistant Digital Archivist
914-366-6386
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ted Lawless
Sent: Monday, September 29, 2014 9:43 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Reconciling corporate names?
OCLC's FAST contains corporate names derived from LCSH.
http://www.oclc.org/research/activities/fast.html
I wrote a simple proxy to the FAST API that can be used as reconciliation endpoint in OpenRefine.
https://github.com/lawlesst/fast-reconcile
Ted
On Mon, Sep 29, 2014 at 9:37 AM, Simon Brown <[log in to unmask]> wrote:
> You could always web scrape, or download and then search the LCNAF
> with some script that looks like:
>
> #Build query for webscraping
> query = paste("http://id.loc.gov/search/?q=", URLencode("corporate
> name here "), "&q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames")
>
> #Make the call
> result = readLines(query)
>
> #Find the lines containing "Corporate Name"
> lines = grep("Corporate Name, result)
>
> #Alternatively use approximate string matching on the downloaded LCNAF
> data query <- agrep("corporate name here",LCNAF_data_here)
>
> #Parse for whatever info you want
> ...
>
> My native programming language is R so I hope the functions like
> paste, readLines, grep, and URLencode are generic enough for other
> languages to have some kind of similar thing. This can just be
> wrapped up into a for
> loop:
> for(i in 1:40000){...}
>
> Web scraping the results of one name at a time would be SLOW and
> obviously using an API is the way to go but it didn't look like the
> OCLC LCNAF API handled Corporate Name. However, it sounds like in the
> previous message someone found a work around. Best of luck! -Simon
>
>
>
>
>
>
> On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers <[log in to unmask]> wrote:
>
>> Hi Patrick,
>>
>> Over the last few weeks I've been doing something very similar. I
>> was able to figure out a process that works using OpenRefine. It
>> works by searching the VIAF API first, limiting results to anything
>> that is a corporate name and has an LC source authority. OpenRefine
>> then extracts the LCCN and puts that through the LCNAF API that OCLC
>> has to get the name. I had to use VIAF for the initial name search
>> because for some reason the LCNAF API doesn't really handle corporate
>> names as search terms very well, but works with the LCCN just fine
>> (there is the possibility that I'm just doing something wrong, and if
>> that's the case, anyone on the list can feel free to correct me). In
>> the end, you get the LC name authority that corresponds to your
>> search term and a link to the authority on the LC Authorities website.
>>
>> Anyway, The process is fairly simple to run (just prepare an Excel
>> spreadsheet and paste JSON commands into OpenRefine). The only
>> reservation is that I don't think it will run all 40,000 of your
>> names at once. I've been using it to run 300-400 names at a time.
>> That said, I'd be happy to share what I did with you if you'd like to
>> try it out. I have some instructions written up in a Word doc, and
>> the JSON script is in a text file, so just email me off list and I can send them to you.
>>
>> Matt
>>
>> Matt Carruthers
>> Metadata Projects Librarian
>> University of Michigan
>> 734-615-5047
>> [log in to unmask]
>>
>> On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson
>> <[log in to unmask]>
>> wrote:
>>
>> > I found the WorldCat Identities API useful for an institution name
>> > disambiguation project that I worked on a few years ago, though my
>> > goal wasn't to confirm whether names mapped to LCNAF. The API
>> > response
>> includes
>> > a LCCN, and you can set it to fuzzy or exact matching, but you
>> > would need to write a script to pass each term in and process the results:
>> >
>> >
>> http://oclc.org/developer/develop/web-services/worldcat-identities.en
>> .html
>> >
>> > I also can't speak to whether all LC Name Authorities are
>> > represented, so there may be a chance of some false negatives.
>> >
>> > OCLC has another API, but not sure if it covers corporate names:
>> > https://platform.worldcat.org/api-explorer/LCNAF
>> >
>> > I suspect there are others on the list that know more about the
>> > inner workings of these APIs if this might be an option for you...
>> > :)
>> >
>> > Karen
>> >
>> > -----Original Message-----
>> > From: Code for Libraries [mailto:[log in to unmask]] On
>> > Behalf Of Ethan Gruber
>> > Sent: Friday, September 26, 2014 3:54 PM
>> > To: [log in to unmask]
>> > Subject: Re: [CODE4LIB] Reconciling corporate names?
>> >
>> > I would check with the developers of SNAC (
>> > http://socialarchive.iath.virginia.edu/), as they've spent a lot of
>> > time developing named entity recognition scripts for personal and
>> > corporate names. They might have something you can reuse.
>> >
>> > Ethan
>> >
>> > On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick <
>> [log in to unmask]
>> > >
>> > wrote:
>> >
>> > > I'm looking to reconcile about 40,000 corporate names against
>> > > LCNAF to see whether they are authorized strings or not, but I'm
>> > > drawing a blank about how to get it done.
>> > >
>> > > I've used http://freeyourmetadata.org/ for reconciling subject
>> > > headings before, but I can't get it to work for LCNAF. Has anyone
>> > > had any experience in a project like this? I'd love to hear some
>> > > ideas for automatically dealing with a large data set like this
>> > > that we did not create and do not know how the names were created.
>> > >
>> > > Thanks!
>> > >
>> > > -Patrick Galligan
>> > >
>> >
>>
>
>
>
> --
> Simon Brown
> [log in to unmask]
> simoncharlesbrown (Skype)
> 831.440.7466 (Phone)
>
> *Following our will and wind we may just go where no one's been --
> MJK*
|