Sorry if this is not helpful, but could you just strip commas and search
wikidata for matching entities? Unquoted queries for "lorde audre" and
"audre lorde" appear to yield identical results. If you are trying to match
the entity's name exactly then obviously this approach does not help.
Without using an LLM to select from a list of entities, I imagine I'd want
to: query on the name form minus commas, filter results to human entities*,
filter for entities with matching terms in the label (lots of possible
adjustments here), present the user with a list of matches to choose from.
* Questions that might need to be asked at this juncture: are there
fictional characters in the data set? Are fictional characters instances of
"human" in wikidata? I see "Oliver Twist" is only a "fictional human" so
perhaps not.
Best,
Eric Phetteplace
Systems Librarian
California College of the Arts
libraries.cca.edu
On Mon, May 4, 2026 at 1:46 PM Stuart A. Yeates <[log in to unmask]> wrote:
> I've got many pages like
> https://id.loc.gov/authorities/names/n2001028682.html (stored in WARC
> files)
>
> I've got names.madsrdf.xml.gz which is all the names in madsrdf, but it's
> disaggregated rather than in the format exampled in
> https://www.loc.gov/standards/mads/rdf/ so it's not really amenable to
> processing in XSL. I'd prefer not to spin up a triple store and reasoner of
> any kind.
>
> I suspect that what I need is the MARCXML, which I'm familiar with
> manipulating with XSL and has all the subfields I need explicitly marked.
>
> As I work, I've been documenting the differences I find between LoC and
> wikidata, on the understanding that bridging LCCNs and wikidata is unlikely
> to be the work of a single person, see
>
> https://www.wikidata.org/wiki/User:Stuartyeates/Wikidata_-_LoC_ontological_mismatches
>
> cheers
> stuart
> --
> ...let us be heard from red core to black sky
>
>
> On Tue, 5 May 2026 at 07:40, Michael Monaco <
> [log in to unmask]> wrote:
>
> > As Kevin mentioned, there are in fact many possible patterns for names to
> > appear in, so it's probably not possible to un-invert all the names in
> the
> > NAF with a single RegEx.
> >
> > You mention that you've downloaded the records in bulk -- what format are
> > the records in? Could you provide some examples?
> >
> > Thanks,
> >
> > Mike Monaco
> > Head, Technical Services & Coordinator, Cataloging Services
> > Associate Professor of Bibliography
> > University Libraries Technical Services
> > 261B Bierce Library
> > The University of Akron
> > Akron, Ohio 44325-1712
> > He/him/his
> > Office: 330-972-2446
> > [log in to unmask]
> > ORCID: 0000-0001-7244-5154
> > https://www.uakron.edu/libraries
> >
> >
> > -----Original Message-----
> > From: Code for Libraries <[log in to unmask]> On Behalf Of Stuart
> A.
> > Yeates
> > Sent: Monday, May 4, 2026 3:07 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised personal
> > names
> >
> > CAUTION:This email originated from outside of The University of Akron.
> Use
> > caution when opening attachments, clicking links or responding to
> requests
> > for information.
> >
> >
> >
> > As it happens, I have already downloaded the records in bulk. What I need
> > is a regexp to parse the "quoted text"
> >
> > cheers
> > stuart
> >
> > --
> > ...let us be heard from red core to black sky
> >
> >
> > On Tue, 5 May 2026 at 06:33, Trail, Nate <[log in to unmask]> wrote:
> >
> > > Stuart,
> > >
> > > You could download the entire Names file in "nt" serialization, then
> > > there's one line for each name you can filter on:
> > >
> > >
> > > <http://id.l/
> > > oc.gov%2Fauthorities%2Fnames%2Fnr2001046558&data=05%7C02%7Cmmonaco%
> > 40UAKRON.EDU
> %7C65c1a7fc4f6d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C639135184716106736%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XITloQ5ZybEL5qrdAojXpx%2FZ21wedG6%2BA%2BO%2B1ix4cok%3D&reserved=0>
> > < http://www.loc.gov/mads/rdf/v1#authoritativeLabel> "Smith, Jim, 1940
> > October 17-" .
> > >
> > > Then you can do what you want with the quoted text.
> > >
> > > Saves bandwidth for you and us.
> > >
> > > https://id.l/
> > > oc.gov%2Fdownload%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4f6
> > > d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C639
> > > 135184716159980%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOi
> > > IwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%
> > > 7C%7C&sdata=T7OhOWgr1s4TxHLYmtL5hgQR7rNT3rcLIT5LfjFSvoA%3D&reserved=0
> > >
> > > Good luck,
> > >
> > > Nate
> > >
> > >
> > > -----------------------------------------
> > > Nate Trail
> > > Network Development & MARC Standards Office LCSG/DPS/ABA/NDMSO Library
> > > of Congress Washington DC 20540
> > >
> > >
> > > -----Original Message-----
> > > From: Code for Libraries <[log in to unmask]> On Behalf Of Kevin
> > > Hawkins
> > > Sent: Monday, May 04, 2026 2:08 PM
> > > To: [log in to unmask]
> > > Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised
> > > personal names
> > >
> > > CAUTION: This email message has been received from an external source.
> > > Please use caution when opening attachments, or clicking on links.
> > >
> > > Hello Stuart,
> > >
> > > Do you mean that you want to convert LCNAF personal names from this
> > > sort of order:
> > >
> > > Mudge, Lewis Seymour, 1868-1945
> > >
> > > to something like this:
> > >
> > > Lewis Seymour Mudge, 1868-1945
> > >
> > > ? But then also deal with authorized forms containing no commas,
> > > forms with more than two commas, and occasional use of parentheses.
> > > So, as you know, it gets complicated.
> > >
> > > I wonder if a different approach might make more sense here:
> > >
> > > 1. Query the inverted LCNAF form at
> > > https://id.l/
> > > oc.gov%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4f6d48f5610608
> > > deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C63913518471617
> > > 8598%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwM
> > > CIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata
> > > =FkP48ZXE11h7Qq1kXsl9JK%2FBhQvnswsYpC8rPoPGgYg%3D&reserved=0
> > >
> > > 2. Retrieve the URI, extracting the identifier (beginning with "n")
> > >
> > > 3. Query Wikidata using this identifier.
> > >
> > > 4. Retrieve Wikidata's form of the name, which is not inverted.
> > >
> > > --Kevin
> > >
> > > On 5/3/26 1:25 PM, Stuart A. Yeates wrote:
> > > > Does anyone know of somewhere that describes LCCN authorised
> > > > personal names as regexps? I want to be able to rewrite them at scale
> > to 'normal'
> > > order.
> > > >
> > > > AI appears to be actively undermining the functionality of search
> > > engines.
> > > >
> > > > cheers
> > > > stuart
> > > > --
> > > > ...let us be heard from red core to black sky
> > >
> >
>
|