Thanks for all the information and discussion.
I don't think I'm familiar enough with Authority file formats to completely
comprehend - but I certainly understand the issues around the question of
'place' vs 'histo-geo-poltical entity'. Some of this makes me worry about
the immediate applicability of the LC Authority files in the Linked Data
space - someone said to me recently 'SKOS is just a way of avoiding dealing
with the real semantics' :)
Anyway - putting that to one side, the simplest approach for me at the
moment seems to only look at authorised LCSH as represented on id.loc.gov.
Picking up on Andy's first response:
On Thu, Apr 7, 2011 at 3:46 PM, Houghton,Andrew <[log in to unmask]> wrote:
> After having done numerous matching and mapping projects, there are some
> issues that you will face with your strategy, assuming I understand it
> correctly. Trying to match a heading starting at the left most subfield and
> working forward will not necessarily produce correct results when matching
> against the LCSH authority file. Using your example:
>
>
>
> 650 _0 $a Education $z England $x Finance
>
>
>
> is a good example of why processing the heading starting at the left will
> not necessarily produce the correct results. Assuming I understand your
> proposal you would first search for:
>
>
>
> 150 __ $a Education
>
>
>
> and find the heading with LCCN sh85040989. Next you would look for:
>
>
>
> 181 __ $z England
>
>
>
> and you would NOT find this heading in LCSH.
>
OK - ignoring the question of where the best place to look for this is - I
can live with not matching it for now. Later (perhaps when I understand it
better, or when these headings are added to id.loc.gov we can revisit this)
> The second issue using your example is that you want to find the “longest”
> matching heading. While the pieces parts are there, so is the enumerated
> authority heading:
>
>
>
> 150 __ $a Education $z England
>
>
>
> as LCCN sh2008102746. So your heading is actually composed of the
> enumerated headings:
>
>
>
> sh2008102746 150 __ $a Education $z England
>
> sh2002007885 180 __ $x Finance
>
>
>
> and not the separate headings:
>
>
>
> sh85040989 150 __ $a Education
>
> n82068148 150 __ $a England
>
> sh2002007885 180 __ $x Finance
>
>
>
> Although one could argue that either analysis is correct depending upon
> what you are trying to accomplish.
>
>
>
What I'm interested in is representing the data as RDF/Linked Data in a way
that opens up the best opportunities for both understanding and querying the
data. Unfortunately at the moment there isn't a good way of representing
LCSH directly in RDF (the MADS work may help I guess but to be honest at the
moment I see that as overly complex - but that's another discussion).
What I can do is make statements that an item is 'about' a subject (probably
using dc:subject) and then point at an id.loc.gov URI. However, if I only
express individual headings:
Education
England (natch)
Finance
Then obviously I lose the context of the full heading - so I also want to
look for
Education--England--Finance (which I won't find on id.loc.gov as not
authorised)
At this point I could stop, but my feeling is that it is useful to also look
for other combinations of the terms:
Education--England (not authorised)
Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008)
My theory is that as long as I stick to combinations that start with a
topical term I'm not going to make startlingly inaccurate statements?
> The matching algorithm I have used in the past contains two routines. The
> first f(a) will accept a heading as a parameter, scrub the heading, e.g.,
> remove unnecessary subfield like $0, $3, $6, $8, etc. and do any other
> pre-processing necessary on the heading, then call the second function f(b).
> The f(b) function accepts a heading as a parameter and recursively calls
> itself until it builds up the list LCCNs that comprise the heading. It first
> looks for the given heading when it doesn’t find it, it removes the **last
> ** subfield and recursively calls itself, otherwise it appends the found
> LCCN to the returned list and exits. This strategy will find the longest
> match.
>
Unless I've misunderstood this, this strategy would not find
'Education--Finance'? Instead I need to remove each *subdivision* in turn
(no matter where it appears in the heading order) and try all possible
combinations checking each for a match on id.loc.gov. Again, I can do this
without worrying about possible invalid headings, as these wouldn't have
been authorised anyway...
I can check the number of variations around this but I guess that in my
limited set of records (only 30k) there will be a relatively small number of
possible patterns to check.
Does that make sense?
|