Hi Sergio,
As Debra mentions, it's a mess -- what you're basically asking for is an
authority file for a field that's not authorized.
The publisher field is transcribed from what's on the piece so there will
be variations even before you consider splits, merges, acquisitions, name
changes, etc which introduce significant philosophical complications even
before the syntactical ones come into play. I do not expect that consulting
the various authority files for corporate name variations is going to be
that helpful even if many publishers are listed there nor do I think the
publisher identifier within the ISBN will be that helpful because of the
realities of how they are actually assigned.
Depending on how many records you have and how many variations of the same
publishers you have, it may or may not be feaible to use some kind of
normalization in combination with manual review to group things.
Good luck. While I see the value in your project, I wouldn't want to touch
it with a 10 foot pole.
kyle
On Thu, Sep 17, 2020 at 8:24 AM Debra Shapiro <
[log in to unmask]> wrote:
> Publishers' names and other corporate names are problematic because they
> change all the time! LoC/BIBFRAME Kevin Ford did some work on this last
> year -
>
> It’s called BF providers -
>
> This is a piece of a report for ALA Midwner 2019 -
> https://cdn.ymaws.com/www.musiclibraryassoc.org/resource/resmgr/BCC_ALA_Reports/2019_ALA-Midwinter_MAC.pdf
>
> Kevin Ford, LCU
> Update on LC’s work with BIBFRAME and streamlining LC’s BF dataset What LC
> has done since ALA Annual: continued pilot work, refined conversion (on
> Github); collaborations with SINOPIA group, and authorities group to
> extract metadata from id.loc.gov; BF Editor updates (cloning works and
> instances, bettter interaction with database and editor); trying to reduce
> verbosity in RDF and trying to reduce blank nodes (anonymous resources in
> RDF)
>
> Re blank nodes, resources identified with blank nodes lack URIs that
> Candice be shared easily. They’re unavoidable in RDF, are written into the
> spec for RDF.
>
> Part of the processing. Should everything have URIs (“URIs are
> commitments”)?KevinFord’s current bugaboo. Results in a lot duplicatation;
> less efficient scaling.
>
> Example from providers in BF: Blank nodes for “United States” and
> “Columbia Pictures Home Entertainment” strings. They worked with an
> experimental Provider file. A data analysis showed that out of ca 15
> million records contained only 1.2M had unique strings. Out of 1.2M
> providers they came up with ca 800K providersafter parsing agents in
> ID.LOC, loaded into ID.LOC, larger than many other files there.
>
> The test file can be accessed at:
>
> http://id.loc.gov/search/?q=memberOf:http://id.loc.gov/bfentities/providers/collection_Providers
> (For an example of clustering and reducing blank nodes:
> http://id.loc.gov/bfentities/providers/4599ff4baa77b72ddd0b65a9972c8b15.html
> )
> These are NOT MEANT TO BE AUTHORITY RECORDS
>
> > On Sep 17, 2020, at 8:14 AM, [log in to unmask] wrote:
> >
> > Hello,
> >
> > I am trying with openrefine to fix all the different versions of
> > publishers' names we have in our records. I would like to reconcile, but
> i
> > have not found yet a reconciliation service that knows most or all of
> the
> > name variants,
> >
> > any ideas?
> >
> > Thank you in advance
>
> [log in to unmask]
> Debra Shapiro
> The iSchool at UW-Madison
> Helen C. White Hall, Rm. 4282
> 600 N. Park St.
> Madison WI 53706
> 608 262 9195
> mobile 608 712 6368
> https://ischool.wisc.edu/blog/staff/shapiro-debra/
> pronouns she | her | hers
>
>
|