LISTSERV 16.5 - CODE4LIB Archives

A visitor here yesterday made an observation relevant to this
discussion.  We were looking at the results of a search for "Don
Quixote" in a yet-to-be-released version of FictionFinder.  The results
were ranked by the number of libraries holding each 'work'.  Here's an
abbreviated version of the results list:

1. Don Quixote  / Cervantes Saavedra, Miguel de
2. History of the Adventures of Joseph Andrews  / Fielding, Henry
3. Morgenlandfahrt  / Hesse, Hermann
4. The Ingenious Gentleman Don Quixote de la Mancha  / Cervantes
Saavedra, Miguel de
5. The Adventures of Don Quixote  / Cervantes Saavedra, Miguel de
6. The First Part of the Delightful History of the Most Ingenious Knight
Don Quixote of the Mancha  / Cervantes Saavedra, Miguel de

Because of some title variations, not all the Don Quixote's are brought
together.  The visitor's point, though, was that #2 by Fielding really
shouldn't be ranked higher than 4, 5, & 6, which seem more closely
related to the "Don Quixote" search than Fielding's (even though Joseph
Andrews is related to Don Quixote).

Of course this might be just the right ordering for someone, but in
general an ordering that takes into account where the search terms
occurred in the records, in addition to how popular the works are,
should work better than one that ignores that information.

--Th


-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Keith Jenkins
Sent: Tuesday, April 11, 2006 2:49 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] Question re: ranking and FRBR

A very interesting discussion here... so I'll support its funding with
my own two cents.

I'd argue that search relevance is a product of two factors:
  A. The overall popularity of an item
  B. The appropriateness to a given query

Both are approximate measures with their own difficulties, but a good
search usually needs to focus on both (unless B is so restrictive that
we don't need A).

B is always going to be inhibited, to various degrees, by the limited
nature of the user's input--usually just a couple of words.  If a user
isn't very specific, then it is indeed quite difficult to determine
what would be most relevant to that user.  That's where A can really
help to sort a large number of results (although B can also help
sorting).  I think Thom makes a good point here:

On 4/10/06, Hickey,Thom <[log in to unmask]> wrote:
> Actually, though, 'relevancy' ranking based on where terms occur in
the
> record and how many times they occur is of minor help compared to some
> sort of popularity score.  WorldCat holdings work fairly well for
that,
> as should circulation data.

In fact, it was this sort of "popularity score" logic that originally
enabled Google to provide a search engine far better than what was
possible using just term placement and frequency metrics for each
document.  Word frequency is probably useless for our short
bibliographic records that are often cataloged at differing levels of
completeness.  But I think it could still be useful to give more
weight to the title and primary author of a book.

The basic mechanism of Google's PageRank algorithm is this: a link
from page X to page Y is a vote by X for Y, and the number of votes
for Y determines the power of Y's vote for other pages.  We could
apply this to FRBR records, if we think of every FRBR relationship as
a two-way link.  In this way, all the items link to the
manifestations, which link to the expressions, which link to the
works.  All manner of derivative works would also be linked to the
original works.  So the most highly-related works get ranked the
highest.  (For the algorithmically-minded, I found the article "XRANK:
Ranked Keyword Search over XML Documents" helpful in understanding how
the PageRank algorithm can be applied to other situations:
http://www.cs.cornell.edu/~cbotev/XRank.pdf )  It would be interesting
to see how such an approach compares to a simple tally of "number of
versions".

-Keith