LISTSERV 16.5 - CODE4LIB Archives

Thanks for all the ideas!

I'm a exploring a few of them this morning.

Patrick Galligan
Rockefeller Archive Center
Assistant Digital Archivist
914-366-6386

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ted Lawless
Sent: Monday, September 29, 2014 9:43 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Reconciling corporate names?

OCLC's FAST contains corporate names derived from LCSH.

http://www.oclc.org/research/activities/fast.html

I wrote a simple proxy to the FAST API that can be used as reconciliation endpoint in OpenRefine.

https://github.com/lawlesst/fast-reconcile

Ted

On Mon, Sep 29, 2014 at 9:37 AM, Simon Brown <[log in to unmask]> wrote:
> You could always web scrape, or download and then search the LCNAF 
> with some script that looks like:
>
> #Build query for webscraping
> query = paste("http://id.loc.gov/search/?q=", URLencode("corporate 
> name here "), "&q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames")
>
> #Make the call
> result = readLines(query)
>
> #Find the lines containing "Corporate Name"
> lines = grep("Corporate Name, result)
>
> #Alternatively use approximate string matching on the downloaded LCNAF 
> data query <- agrep("corporate name here",LCNAF_data_here)
>
> #Parse for whatever info you want
> ...
>
> My native programming language is R so I hope the functions like 
> paste, readLines, grep, and URLencode are generic enough for other 
> languages to have some kind of similar thing.  This can just be 
> wrapped up into a for
> loop:
> for(i in 1:40000){...}
>
> Web scraping the results of one name at a time would be SLOW and 
> obviously using an API is the way to go but it didn't look like the 
> OCLC LCNAF API handled Corporate Name.  However, it sounds like in the 
> previous message someone found a work around.  Best of luck! -Simon
>
>
>
>
>
>
> On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers <[log in to unmask]> wrote:
>
>> Hi Patrick,
>>
>> Over the last few weeks I've been doing something very similar.  I 
>> was able to figure out a process that works using OpenRefine.  It 
>> works by searching the VIAF API first, limiting results to anything 
>> that is a corporate name and has an LC source authority.  OpenRefine 
>> then extracts the LCCN and puts that through the LCNAF API that OCLC 
>> has to get the name.  I had to use VIAF for the initial name search 
>> because for some reason the LCNAF API doesn't really handle corporate 
>> names as search terms very well, but works with the LCCN just fine 
>> (there is the possibility that I'm just doing something wrong, and if 
>> that's the case, anyone on the list can feel free to correct me).  In 
>> the end, you get the LC name authority that corresponds to your 
>> search term and a link to the authority on the LC Authorities website.
>>
>> Anyway,  The process is fairly simple to run (just prepare an Excel 
>> spreadsheet and paste JSON commands into OpenRefine).  The only 
>> reservation is that I don't think it will run all 40,000 of your 
>> names at once.  I've been using it to run 300-400 names at a time.  
>> That said, I'd be happy to share what I did with you if you'd like to 
>> try it out.  I have some instructions written up in a Word doc, and 
>> the JSON script is in a text file, so just email me off list and I can send them to you.
>>
>> Matt
>>
>> Matt Carruthers
>> Metadata Projects Librarian
>> University of Michigan
>> 734-615-5047
>> [log in to unmask]
>>
>> On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson 
>> <[log in to unmask]>
>> wrote:
>>
>> > I found the WorldCat Identities API useful for an institution name 
>> > disambiguation project that I worked on a few years ago, though my 
>> > goal wasn't to confirm whether names mapped to LCNAF.  The API 
>> > response
>> includes
>> > a LCCN, and you can set it to fuzzy or exact matching, but you 
>> > would need to write a script to pass each term in and process the results:
>> >
>> >
>> http://oclc.org/developer/develop/web-services/worldcat-identities.en
>> .html
>> >
>> > I also can't speak to whether all LC Name Authorities are 
>> > represented, so there may be a chance of some false negatives.
>> >
>> > OCLC has another API, but not sure if it covers corporate names:
>> > https://platform.worldcat.org/api-explorer/LCNAF
>> >
>> > I suspect there are others on the list that know more about the 
>> > inner workings of these APIs if this might be an option for you... 
>> > :)
>> >
>> > Karen
>> >
>> > -----Original Message-----
>> > From: Code for Libraries [mailto:[log in to unmask]] On 
>> > Behalf Of Ethan Gruber
>> > Sent: Friday, September 26, 2014 3:54 PM
>> > To: [log in to unmask]
>> > Subject: Re: [CODE4LIB] Reconciling corporate names?
>> >
>> > I would check with the developers of SNAC ( 
>> > http://socialarchive.iath.virginia.edu/), as they've spent a lot of 
>> > time developing named entity recognition scripts for personal and 
>> > corporate names. They might have something you can reuse.
>> >
>> > Ethan
>> >
>> > On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick <
>> [log in to unmask]
>> > >
>> > wrote:
>> >
>> > > I'm looking to reconcile about 40,000 corporate names against 
>> > > LCNAF to see whether they are authorized strings or not, but I'm 
>> > > drawing a blank about how to get it done.
>> > >
>> > > I've used http://freeyourmetadata.org/ for reconciling subject 
>> > > headings before, but I can't get it to work for LCNAF. Has anyone 
>> > > had any experience in a project like this? I'd love to hear some 
>> > > ideas for automatically dealing with a large data set like this 
>> > > that we did not create and do not know how the names were created.
>> > >
>> > > Thanks!
>> > >
>> > > -Patrick Galligan
>> > >
>> >
>>
>
>
>
> --
> Simon Brown
> [log in to unmask]
> simoncharlesbrown (Skype)
> 831.440.7466 (Phone)
>
> *Following our will and wind we may just go where no one's been -- 
> MJK*