I spent a little time dealing with that set of huge XML files and wrote a crude java STaX parser (Streaming API for Xml) that constructed objects as it passed through the file, dumping them into a database. It currently ignores most of the content and just captures a few fields (by name and partial path) as it hits them, but it easy to extend and has the advantage of not having to load those enormous files at once. Once the information (or subset of the information) is in a database, more functionality may be implemented.
Fortunately at the time, the database model was designed around having any number of broader or narrower terms... unfortunately it wasn't really designed to present such a large hierarchy in a reasonable way.
I've attached the current incarnation of that code which stuffs some fields and a ZThes record (using a castor representation of the zthes schema) into a lucene index (as opposed to the original database, since this is simpler). It could pretty easily be adapted to only include terms of a certain type (potentially excluding 100's of thousands of rivers and streams) and maybe even run in a reasonable amount of time. I've commented out all the portions that require lucene or castor, so it only depends on a stax implementation and you could plug in whatever database or output format you desired.
As for the higher semantic and usage issues... we haven't really addressed those yet.
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Dwiggins David
Sent: Wednesday, February 25, 2009 10:28 AM
To: [log in to unmask]
Subject: [CODE4LIB] Working with Getty vocabularies
Is there anyone out there with experience processing the raw data files for the Getty vocabularies (particularly TGN)?
We're adopting AAT and TGN as the primary vocabularies for our new shared cataloging system for our museum, library and archival collections. I'm presently trying to come up with some scripts to automate matching of places in existing databases to places in the TGN taxonomy. But I'm finding that the Getty data files are very complex, and I haven't yet figured out a foolproof method to do this. I'm curious if anyone else has traveled this road before, and if so whether you might be able to share some tips or code snippets.
Since most of our place names are going to be in the US, my gut feeling has been to first try to extract a list of places in the US and dump things like state, county, etc. into discrete database fields that I can match against. But I find myself a bit flummoxed by the polyhierarchical nature of the data (where one place can belong to multiple higher level places).
Another issue is the wide variety of place types in use in the taxonomy. England, for example, is a country, but the United States is a nation. This makes sense to a degree, but it also makes it a bit hard to figure out which term to match when you're trying to automate matching against data where the creators were less discerning about this sort of fine distinction.
I feel like I'm surely not the first person to tackle this, and would love to exchange notes...
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
(617) 227-3956 x 242
[log in to unmask]
http://www.historicnewengland.org ( http://www.historicnewengland.org/ )
Visit http://www.LymanEstate.org for information on renting the historic Lyman Estate for your next event - a very special place for very special occasions.