Interesting. Our XML is being provided by ProQuest. We're cancelling our subscription to their Historical NYT database and they are offering up XML of the articles for 1851-1938.
Buddy Pennington
Head of Electronic Resources & Systems
University Libraries
University of Missouri - Kansas City
(he/him/his)
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Custer, Mark
Sent: Thursday, December 17, 2020 2:46 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Web app to search XML files
WARNING: This message has originated from an External Source. This may be a phishing expedition that can result in unauthorized access to our IT System. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.
Exactly my thoughts, as well, Buddy.
I was going to recommend BaseX, at least as a first step to investigate the full corpus (https://docs.basex.org/wiki/Getting_Started). It indexes XML documents very quickly (and in their entirety, which is important whether you want to use those documents as is or transform them to something else). You can do that from the command line or, without even taking the time to learn its commands, you can use the GUI to 1) index and 2) get an overview of your new database properties, including an exhaustive summary of attributes, elements, path structure, etc.
That said, I'm now curious about the NYT XML dataset in general. Can you provide a link to more info about it?
I just did a very quick bit of searching, and found this interesting blog post, https://open.blogs.nytimes.com/2016/07/26/the-future-of-the-past-modernizing-the-new-york-times-archive, which I believe describes how that dataset (or a similar one) was handled internally, converting it and HTML documents for missing XML docs into JSON. Due to that, I expect it's not the type of XML that you'll need to retain as XML, but getting a full view of the entire forest is probably the only way to know for sure.
Mark
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Pennington, Buddy D.
Sent: Thursday, 17 December, 2020 3:31 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Web app to search XML files
Yes, lots of excellent suggestions from folks. I was actually looking at Basex earlier today as a tool to review the XML once we have it.
Thanks!
Buddy Pennington
Head of Electronic Resources & Systems
University Libraries
University of Missouri - Kansas City
(he/him/his)
-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of David Mayo
Sent: Thursday, December 17, 2020 2:15 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Web app to search XML files
WARNING: This message has originated from an External Source. This may be a phishing expedition that can result in unauthorized access to our IT System. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.
A lot of good suggestions; if you're looking for fast turnaround without having to decompose and shift the data, it might be worth looking at dedicated XML databases like eXistDB and Basex
https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fexist-db.org%2Fexist%2Fapps%2Fhomepage%2Findex.html&data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eL%2FucCumXe8y8a5oAqdothJKneDwvcLdncQ3AB9ckcI%3D&reserved=0
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbasex.org%2F&data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k2TDcMoP%2F1uFjApdAZwc4zdisOnIf6KZnjCtqMVTQX4%3D&reserved=0
IIRC, eXist-db has dedicated functionality for building applications built in; even if you don't go that way, I've found these very useful for analysis of XML corpuses prior to running other software to transform them.
- Dave Mayo (He/Him)
Software Dev @ Harvard LTS
On Thu, Dec 17, 2020 at 2:53 PM Stuart A. Yeates <[log in to unmask]> wrote:
> There's XML and XML.
>
> I suggest that you enquire about the exact format that you're going to
> be receiving and ask around for systems that support it out of the
> box.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Fri, 18 Dec 2020 at 07:37, Pennington, Buddy D.
> <[log in to unmask]>
> wrote:
> >
> > Hi all,
> >
> > We're purchasing an XML dataset for the historical NY Times and I am
> curious about any suggestions to quickly build a web app to search and
> display those records for end users.
> >
> > Buddy Pennington
> > Head of Electronic Resources & Systems University Libraries
> > University of Missouri - Kansas City
> > (he/him/his)
>
|