Thanks for all the ideas! I'm a exploring a few of them this morning. Patrick Galligan Rockefeller Archive Center Assistant Digital Archivist 914-366-6386 -----Original Message----- From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Ted Lawless Sent: Monday, September 29, 2014 9:43 AM To: [log in to unmask] Subject: Re: [CODE4LIB] Reconciling corporate names? OCLC's FAST contains corporate names derived from LCSH. http://www.oclc.org/research/activities/fast.html I wrote a simple proxy to the FAST API that can be used as reconciliation endpoint in OpenRefine. https://github.com/lawlesst/fast-reconcile Ted On Mon, Sep 29, 2014 at 9:37 AM, Simon Brown <[log in to unmask]> wrote: > You could always web scrape, or download and then search the LCNAF > with some script that looks like: > > #Build query for webscraping > query = paste("http://id.loc.gov/search/?q=", URLencode("corporate > name here "), "&q=cs%3Ahttp%3A%2F%2Fid.loc.gov%2Fauthorities%2Fnames") > > #Make the call > result = readLines(query) > > #Find the lines containing "Corporate Name" > lines = grep("Corporate Name, result) > > #Alternatively use approximate string matching on the downloaded LCNAF > data query <- agrep("corporate name here",LCNAF_data_here) > > #Parse for whatever info you want > ... > > My native programming language is R so I hope the functions like > paste, readLines, grep, and URLencode are generic enough for other > languages to have some kind of similar thing. This can just be > wrapped up into a for > loop: > for(i in 1:40000){...} > > Web scraping the results of one name at a time would be SLOW and > obviously using an API is the way to go but it didn't look like the > OCLC LCNAF API handled Corporate Name. However, it sounds like in the > previous message someone found a work around. Best of luck! -Simon > > > > > > > On Mon, Sep 29, 2014 at 8:45 AM, Matt Carruthers <[log in to unmask]> wrote: > >> Hi Patrick, >> >> Over the last few weeks I've been doing something very similar. I >> was able to figure out a process that works using OpenRefine. It >> works by searching the VIAF API first, limiting results to anything >> that is a corporate name and has an LC source authority. OpenRefine >> then extracts the LCCN and puts that through the LCNAF API that OCLC >> has to get the name. I had to use VIAF for the initial name search >> because for some reason the LCNAF API doesn't really handle corporate >> names as search terms very well, but works with the LCCN just fine >> (there is the possibility that I'm just doing something wrong, and if >> that's the case, anyone on the list can feel free to correct me). In >> the end, you get the LC name authority that corresponds to your >> search term and a link to the authority on the LC Authorities website. >> >> Anyway, The process is fairly simple to run (just prepare an Excel >> spreadsheet and paste JSON commands into OpenRefine). The only >> reservation is that I don't think it will run all 40,000 of your >> names at once. I've been using it to run 300-400 names at a time. >> That said, I'd be happy to share what I did with you if you'd like to >> try it out. I have some instructions written up in a Word doc, and >> the JSON script is in a text file, so just email me off list and I can send them to you. >> >> Matt >> >> Matt Carruthers >> Metadata Projects Librarian >> University of Michigan >> 734-615-5047 >> [log in to unmask] >> >> On Fri, Sep 26, 2014 at 7:03 PM, Karen Hanson >> <[log in to unmask]> >> wrote: >> >> > I found the WorldCat Identities API useful for an institution name >> > disambiguation project that I worked on a few years ago, though my >> > goal wasn't to confirm whether names mapped to LCNAF. The API >> > response >> includes >> > a LCCN, and you can set it to fuzzy or exact matching, but you >> > would need to write a script to pass each term in and process the results: >> > >> > >> http://oclc.org/developer/develop/web-services/worldcat-identities.en >> .html >> > >> > I also can't speak to whether all LC Name Authorities are >> > represented, so there may be a chance of some false negatives. >> > >> > OCLC has another API, but not sure if it covers corporate names: >> > https://platform.worldcat.org/api-explorer/LCNAF >> > >> > I suspect there are others on the list that know more about the >> > inner workings of these APIs if this might be an option for you... >> > :) >> > >> > Karen >> > >> > -----Original Message----- >> > From: Code for Libraries [mailto:[log in to unmask]] On >> > Behalf Of Ethan Gruber >> > Sent: Friday, September 26, 2014 3:54 PM >> > To: [log in to unmask] >> > Subject: Re: [CODE4LIB] Reconciling corporate names? >> > >> > I would check with the developers of SNAC ( >> > http://socialarchive.iath.virginia.edu/), as they've spent a lot of >> > time developing named entity recognition scripts for personal and >> > corporate names. They might have something you can reuse. >> > >> > Ethan >> > >> > On Fri, Sep 26, 2014 at 3:47 PM, Galligan, Patrick < >> [log in to unmask] >> > > >> > wrote: >> > >> > > I'm looking to reconcile about 40,000 corporate names against >> > > LCNAF to see whether they are authorized strings or not, but I'm >> > > drawing a blank about how to get it done. >> > > >> > > I've used http://freeyourmetadata.org/ for reconciling subject >> > > headings before, but I can't get it to work for LCNAF. Has anyone >> > > had any experience in a project like this? I'd love to hear some >> > > ideas for automatically dealing with a large data set like this >> > > that we did not create and do not know how the names were created. >> > > >> > > Thanks! >> > > >> > > -Patrick Galligan >> > > >> > >> > > > > -- > Simon Brown > [log in to unmask] > simoncharlesbrown (Skype) > 831.440.7466 (Phone) > > *Following our will and wind we may just go where no one's been -- > MJK*