Great input....thanks for the discussion. A few notes on what we're
seeing on our end:
a) We've just finished testing a content-based relevance ranking method
with users; it works well for academic users, and will probably work
better with some tweaking of field weights. I'll have more specific
results in a month or two.
b) We experimented with "boosting" some documents using holdings data.
OCLC gave us both WorldCat-wide and UC-wide numbers to work with.
David's observation is right on...the mix in WorldCat skews things for
academic libraries, so we rejected that and used UC system-wide numbers
in our experiments.
And I do agree with Jonathan's excellent summary...summing the scores is
not the right approach. And yet...I wish I could explain why it seems as
though the clustering can tell us something.
--Colleen
David Walker wrote:
>The only tricky thing about this with WorldCat, though, is that you have
>such a large mix of libraries.
>
>In my own searching on WorldCat, I've noticed that a fair amount of
>fiction and non-scholarly works appear near the top of results because
>the public libraries are skewing the holdings of those titles.
>
>Not a bad thing in itself, if that's what I'm looking for, but our
>students are looking for scholarly works (and still learning to
>distinguish scholarly from not), so would be nice in our particular
>context to limit only to academic libraries that own the title.
>
>--Dave
>
>=========================
>David Walker
>Web Development Librarian
>Library, Cal State San Marcos
>760-750-4379
>http://public.csusm.edu/dwalker
>=========================
>
>
>
>
>
>-----Original Message-----
>From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>Hickey,Thom
>Sent: Monday, April 10, 2006 12:52 PM
>To: [log in to unmask]
>Subject: Re: [CODE4LIB] Question re: ranking and FRBR
>
>I'd agree with this.
>
>Actually, though, 'relevancy' ranking based on where terms occur in the
>record and how many times they occur is of minor help compared to some
>sort of popularity score. WorldCat holdings work fairly well for that,
>as should circulation data. The primary example of this sort of ranking
>is the web search engines where ranking is based primarily on word
>proximity and links.
>
>--Th
>
>
>-----Original Message-----
>From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>Jonathan Rochkind
>Sent: Monday, April 10, 2006 3:16 PM
>To: [log in to unmask]
>Subject: Re: [CODE4LIB] Question re: ranking and FRBR
>
>When you are ranking on number of holdings like OCLC is, a straight
>sum makes sense to me---the sum of all libraries holding copies of
>any manifestation of the FRBR work is indeed the sum of the holdings
>for all the records in the FRBR work set. Of course.
>
>If you're doing relavancy rankings instead though, a straight sum
>makes less sense. A relevancy ranking isn't really amenable to being
>summed. The sum of the relevancy rankings for various
>manifestations/expressions is not probably not a valid indicator of
>how relevant the work is to the user, right? And if you did it this
>way, it would tend to make the most _voluminous_ work always come out
>first as the most 'relevant', which isn't quite right.---This isn't
>quite the same problem as OCLC's having the bible come out on
>top---since OCLC is ranking by holdings, it's exactly right to have
>the bible come out on top, the Bible is indeed surely one of the
>(#1?) most held works, so it's quite right for it to be on top. But
>the bible isn't always going to be the most relevant result for a
>user, just because it's the most voluminous! Summing is going to
>mess up your relevancy rankings.
>
>Just using the maximum relevancy ranking from the work set seems
>acceptable to me--the work's relevancy to the user is indicated by
>the most relevant manifestation in the set. There might be a better
>way to do it (Is a work with four manifestations with a relevancy
>ranking .7 more relevant than a work with just one manifestation with
>a ranking of .9? I don't think it probably is, actually; I think
>just taking the maximum should work fine. But it depends on the
>relevancy algorithm maybe.). I don't think I'm enough of a
>mathematician to understand the point of the log of the sum, though,
>hmm.
>
>--Jonathan
>
>At 2:38 PM -0400 4/10/06, Hickey,Thom wrote:
>
>
>>We're doing straight sums of the holdings of all the manifestations in
>>the work. It's hard for me to see the need to discount holdings in
>>multiple manifestations. It does mean that 'bible' tends to come to
>>
>>
>the
>
>
>>top for many searches, but that's about the only work-set I see coming
>>up unexpectedly to the top.
>>
>>If we had circulation data we'd certainly factor that in (or maybe just
>>use it if it was comprehensive enough).
>>
>>--Th
>>
>>
>>-----Original Message-----
>>From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>>Colleen Whitney
>>Sent: Monday, April 10, 2006 2:04 PM
>>To: [log in to unmask]
>>Subject: Re: [CODE4LIB] Question re: ranking and FRBR
>>
>>Thanks...is it just a straight sum, Thom?
>>
>>--C
>>
>>Hickey,Thom wrote:
>>
>>
>>
>>>Here at OCLC we're ranking based on the holdings of all the records in
>>>the retrieved work set. Seems to work pretty well.
>>>
>>>--Th
>>>
>>>-----Original Message-----
>>>From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>>>
>>>
>Of
>
>
>>>Colleen Whitney
>>>Sent: Monday, April 10, 2006 1:06 PM
>>>To: [log in to unmask]
>>>Subject: [CODE4LIB] Question re: ranking and FRBR
>>>
>>>Hello all,
>>>
>>>Here's a question for anyone who has been thinking about or working
>>>
>>>
>>with
>>
>>
>>>FRBR for creating record groupings for display. (Perhaps others have
>>>already discussed or addressed this...in which case I'd be happy to
>>>
>>>
>>have
>>
>>
>>>a pointer to resources that are already out there.)
>>>
>>>In a retrieval environment that presents ranked results (ranked by
>>>record content, optionally boosted by circulation and/or holdings),
>>>
>>>
>how
>
>
>>>could/should FRBR-like record groupings be factored into ranking?
>>>Several approaches have been discussed here:
>>> - Rank the results using the score from the highest-scoring record
>>>
>>>
>in
>
>
>>a
>>
>>
>>>group
>>> - Use the sum of scores of documents in a group (this seems to me to
>>>place too much weight on the group)
>>> - Use the log of the sum of the scores of documents in a group
>>>
>>>I'd be very interested in knowing whether others have already been
>>>thinking about this....
>>>
>>>Regards,
>>>
>>>--Colleen Whitney
>>>
>>>
>>>
>>>
|