Thanks for all the information and discussion. I don't think I'm familiar enough with Authority file formats to completely comprehend - but I certainly understand the issues around the question of 'place' vs 'histo-geo-poltical entity'. Some of this makes me worry about the immediate applicability of the LC Authority files in the Linked Data space - someone said to me recently 'SKOS is just a way of avoiding dealing with the real semantics' :) Anyway - putting that to one side, the simplest approach for me at the moment seems to only look at authorised LCSH as represented on id.loc.gov. Picking up on Andy's first response: On Thu, Apr 7, 2011 at 3:46 PM, Houghton,Andrew <[log in to unmask]> wrote: > After having done numerous matching and mapping projects, there are some > issues that you will face with your strategy, assuming I understand it > correctly. Trying to match a heading starting at the left most subfield and > working forward will not necessarily produce correct results when matching > against the LCSH authority file. Using your example: > > > > 650 _0 $a Education $z England $x Finance > > > > is a good example of why processing the heading starting at the left will > not necessarily produce the correct results. Assuming I understand your > proposal you would first search for: > > > > 150 __ $a Education > > > > and find the heading with LCCN sh85040989. Next you would look for: > > > > 181 __ $z England > > > > and you would NOT find this heading in LCSH. > OK - ignoring the question of where the best place to look for this is - I can live with not matching it for now. Later (perhaps when I understand it better, or when these headings are added to id.loc.gov we can revisit this) > The second issue using your example is that you want to find the “longest” > matching heading. While the pieces parts are there, so is the enumerated > authority heading: > > > > 150 __ $a Education $z England > > > > as LCCN sh2008102746. So your heading is actually composed of the > enumerated headings: > > > > sh2008102746 150 __ $a Education $z England > > sh2002007885 180 __ $x Finance > > > > and not the separate headings: > > > > sh85040989 150 __ $a Education > > n82068148 150 __ $a England > > sh2002007885 180 __ $x Finance > > > > Although one could argue that either analysis is correct depending upon > what you are trying to accomplish. > > > What I'm interested in is representing the data as RDF/Linked Data in a way that opens up the best opportunities for both understanding and querying the data. Unfortunately at the moment there isn't a good way of representing LCSH directly in RDF (the MADS work may help I guess but to be honest at the moment I see that as overly complex - but that's another discussion). What I can do is make statements that an item is 'about' a subject (probably using dc:subject) and then point at an id.loc.gov URI. However, if I only express individual headings: Education England (natch) Finance Then obviously I lose the context of the full heading - so I also want to look for Education--England--Finance (which I won't find on id.loc.gov as not authorised) At this point I could stop, but my feeling is that it is useful to also look for other combinations of the terms: Education--England (not authorised) Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008) My theory is that as long as I stick to combinations that start with a topical term I'm not going to make startlingly inaccurate statements? > The matching algorithm I have used in the past contains two routines. The > first f(a) will accept a heading as a parameter, scrub the heading, e.g., > remove unnecessary subfield like $0, $3, $6, $8, etc. and do any other > pre-processing necessary on the heading, then call the second function f(b). > The f(b) function accepts a heading as a parameter and recursively calls > itself until it builds up the list LCCNs that comprise the heading. It first > looks for the given heading when it doesn’t find it, it removes the **last > ** subfield and recursively calls itself, otherwise it appends the found > LCCN to the returned list and exits. This strategy will find the longest > match. > Unless I've misunderstood this, this strategy would not find 'Education--Finance'? Instead I need to remove each *subdivision* in turn (no matter where it appears in the heading order) and try all possible combinations checking each for a match on id.loc.gov. Again, I can do this without worrying about possible invalid headings, as these wouldn't have been authorised anyway... I can check the number of variations around this but I guess that in my limited set of records (only 30k) there will be a relatively small number of possible patterns to check. Does that make sense?