Yes -- it is something I created out of thin air.
It was originally designed for catalogers who wanted a visual display to
duplicate the print, and achieving adequate performance on interactive
search, retrieval, and rendering on the computers/browsers at the time made
me have to include all the formatting.
To bust it up, split on [log in to unmask] That will give you individual records. The
labels will tell you the role of information. For example,
'Assigned code(s):\n '
will be followed by newline delimited codes for the rest of the field
'\n USE '
indicates a SEE reference, while any line that does not contain newlines
simply contains a single code. I realize it sounds nuts, but there aren't
that many variations so it's not as bad as it looks.
Since you just want pairs, you might want to load values that have codes
into a dictionary so when you encounter a SEE reference, you can create a
key value pair. The issue with ignoring alternate names is that there are a
number of nonintuitive connections that people wouldn't be able to make.
On Wed, Jun 22, 2011 at 3:11 PM, Jonathan Rochkind <[log in to unmask]> wrote:
> PS: Kyle, that's your own version? That's... sort of kind of machine
> readable. Well, not really. I can't figure out quite what's going on there,
> literals, seperated by newlines, or sometimes (but sometimes not) with
> "Assigned code:" strings, etc.
> That's in fact a little bit harder to parse then what I'm doing against
> LC. I'm running CSS selectors against the HTML; I'm not having any
> difficulty parsing, the problem is that the format can change without
> notice. But yours seems harder to parse to me, am I missing something?
> In the end, all I need is a list of pairs, code to label. I'll be looking
> up from code, so I don't even care about "alternate labels", really.
> On 6/22/2011 5:57 PM, Kyle Banerjee wrote:
> I went through a process similar to what you describe sometime back for a
> tool I made (i.e. I could find no easily downloadable info). You can
> download something that will be easier to parse from
> It's probably not 100% accurate as I haven't downloaded for quite awhile.
> But catalogers have me correct errors they discover and there are about 800
> unique visitors per day so I assume they notice most things.
> It would be nice if this kind of data could be provided in a straightforward
> On Wed, Jun 22, 2011 at 2:44 PM, Jonathan Rochkind <[log in to unmask]> <[log in to unmask]> wrote:
> Can anyone remind me if there's a machine readable copy of the MARC
> geographic codes available at any persistent URL?
> They're in HTML at http://www.loc.gov/marc/**geoareas/gacs_code.html<http://www.loc.gov/marc/geoareas/gacs_code.html> <http://www.loc.gov/marc/geoareas/gacs_code.html>. I actually had a script that automatically downloaded from there and
> "scraped" the HTML -- but sometime since I wrote the script, the HTML
> structure on the page changed and it broke.
> (I kind of thought that was unlikely since that HTML page itself was
> machine generated -- but I guess they changed the software that generated
> it. Certainly I knew that scraping HTML was a bad thing to rely on... which
> is why I hope LC provides this in some format less likely to change?)
Digital Services Program Manager
Orbis Cascade Alliance
[log in to unmask] / 503.877.9773