MARC xml and MADS xml are listed at the bottom of each Name page under "Alternate Formats". Since you are using XSL those should work for you way better than a scraped html page in a warc file.
If you know the lccn, you can fetch the single page in the serialization you like:
https://id.loc.gov/authorities/names/n2001028682.madsxml.xml
https://id.loc.gov/authorities/names/n2001028682.marcxml.xml
Nate
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Stuart A. Yeates
Sent: Monday, May 04, 2026 4:45 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised personal names
CAUTION: This email message has been received from an external source. Please use caution when opening attachments, or clicking on links.
I've got many pages like
https://id.loc.gov/authorities/names/n2001028682.html (stored in WARC
files)
I've got names.madsrdf.xml.gz which is all the names in madsrdf, but it's disaggregated rather than in the format exampled in https://www.loc.gov/standards/mads/rdf/ so it's not really amenable to processing in XSL. I'd prefer not to spin up a triple store and reasoner of any kind.
I suspect that what I need is the MARCXML, which I'm familiar with manipulating with XSL and has all the subfields I need explicitly marked.
As I work, I've been documenting the differences I find between LoC and wikidata, on the understanding that bridging LCCNs and wikidata is unlikely to be the work of a single person, see https://urldefense.us/v3/__https://www.wikidata.org/wiki/User:Stuartyeates/Wikidata_-_LoC_ontological_mismatches__;!!MrYkk0_46kUzGAu-DfDRZGQ!eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F55JLGs_WLC$
cheers
stuart
--
...let us be heard from red core to black sky
On Tue, 5 May 2026 at 07:40, Michael Monaco < [log in to unmask]> wrote:
> As Kevin mentioned, there are in fact many possible patterns for names
> to appear in, so it's probably not possible to un-invert all the names
> in the NAF with a single RegEx.
>
> You mention that you've downloaded the records in bulk -- what format
> are the records in? Could you provide some examples?
>
> Thanks,
>
> Mike Monaco
> Head, Technical Services & Coordinator, Cataloging Services Associate
> Professor of Bibliography University Libraries Technical Services 261B
> Bierce Library The University of Akron Akron, Ohio 44325-1712
> He/him/his
> Office: 330-972-2446
> [log in to unmask]
> ORCID: 0000-0001-7244-5154
> https://urldefense.us/v3/__https://www.uakron.edu/libraries__;!!MrYkk0
> _46kUzGAu-DfDRZGQ!eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJz
> ptHPrFpLyJhG_F55JEk5-yDI$
>
>
> -----Original Message-----
> From: Code for Libraries <[log in to unmask]> On Behalf Of Stuart A.
> Yeates
> Sent: Monday, May 4, 2026 3:07 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised
> personal names
>
> CAUTION:This email originated from outside of The University of Akron.
> Use caution when opening attachments, clicking links or responding to
> requests for information.
>
>
>
> As it happens, I have already downloaded the records in bulk. What I
> need is a regexp to parse the "quoted text"
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
>
> On Tue, 5 May 2026 at 06:33, Trail, Nate <[log in to unmask]> wrote:
>
> > Stuart,
> >
> > You could download the entire Names file in "nt" serialization, then
> > there's one line for each name you can filter on:
> >
> >
> > <https://urldefense.us/v3/__http://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ
> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JFyT7rah$
> > oc.gov%2Fauthorities%2Fnames%2Fnr2001046558&data=05%7C02%7Cmmonaco%
> 40UAKRON.EDU%7C65c1a7fc4f6d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa
> 0b32991aeedd%7C0%7C0%7C639135184716106736%7CUnknown%7CTWFpbGZsb3d8eyJF
> bXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbC
> IsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XITloQ5ZybEL5qrdAojXpx%2FZ21wedG
> 6%2BA%2BO%2B1ix4cok%3D&reserved=0>
> < http://www.loc.gov/mads/rdf/v1#authoritativeLabel > "Smith, Jim,
> 1940 October 17-" .
> >
> > Then you can do what you want with the quoted text.
> >
> > Saves bandwidth for you and us.
> >
> > https://urldefense.us/v3/__https://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ
> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JKbGlPyQ$
> > oc.gov%2Fdownload%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4
> > f6
> > d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C6
> > 39
> > 135184716159980%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> > Oi
> > IwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7
> > C%
> > 7C%7C&sdata=T7OhOWgr1s4TxHLYmtL5hgQR7rNT3rcLIT5LfjFSvoA%3D&reserved=
> > 0
> >
> > Good luck,
> >
> > Nate
> >
> >
> > -----------------------------------------
> > Nate Trail
> > Network Development & MARC Standards Office LCSG/DPS/ABA/NDMSO
> > Library of Congress Washington DC 20540
> >
> >
> > -----Original Message-----
> > From: Code for Libraries <[log in to unmask]> On Behalf Of
> > Kevin Hawkins
> > Sent: Monday, May 04, 2026 2:08 PM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised
> > personal names
> >
> > CAUTION: This email message has been received from an external source.
> > Please use caution when opening attachments, or clicking on links.
> >
> > Hello Stuart,
> >
> > Do you mean that you want to convert LCNAF personal names from this
> > sort of order:
> >
> > Mudge, Lewis Seymour, 1868-1945
> >
> > to something like this:
> >
> > Lewis Seymour Mudge, 1868-1945
> >
> > ? But then also deal with authorized forms containing no commas,
> > forms with more than two commas, and occasional use of parentheses.
> > So, as you know, it gets complicated.
> >
> > I wonder if a different approach might make more sense here:
> >
> > 1. Query the inverted LCNAF form at
> > https://urldefense.us/v3/__https://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ
> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JKbGlPyQ$
> > oc.gov%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4f6d48f56106
> > 08
> > deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C639135184716
> > 17
> > 8598%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDA
> > wM
> > CIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sda
> > ta
> > =FkP48ZXE11h7Qq1kXsl9JK%2FBhQvnswsYpC8rPoPGgYg%3D&reserved=0
> >
> > 2. Retrieve the URI, extracting the identifier (beginning with "n")
> >
> > 3. Query Wikidata using this identifier.
> >
> > 4. Retrieve Wikidata's form of the name, which is not inverted.
> >
> > --Kevin
> >
> > On 5/3/26 1:25 PM, Stuart A. Yeates wrote:
> > > Does anyone know of somewhere that describes LCCN authorised
> > > personal names as regexps? I want to be able to rewrite them at
> > > scale
> to 'normal'
> > order.
> > >
> > > AI appears to be actively undermining the functionality of search
> > engines.
> > >
> > > cheers
> > > stuart
> > > --
> > > ...let us be heard from red core to black sky
> >
>
|