Print

Print


Arash - you might not want to use a straight dump of worldcat catalog
records- at least not without the associated holdings information.*

There are a lot of quasi-duplicate records that are  sufficiently broken
that the worldcat de-duplication algorithm refuses to merge them.  These
records will usually only be used by a handful of institutions;  the better
records will tend to have more associated holdings.  The holdings count
should be used to weight the strength of association between class numbers
and features.

Also, since classification/categorization is something that is usually
considered to be a property of works, rather than manifestations, one might
get better results by using Work sets for training.

I would suggest, er, contacting  Thom Hickey.

Simon

* Well, not precisely holdings - you just need the number of distinct
institutions with at least one copy.  I call them 'hasings'.

On Sat, May 19, 2012 at 8:42 PM, Roy Tennant <[log in to unmask]> wrote:

> Arash,
> Yes, we have made WorldCat available to researchers under a special
> license agreement. I suggest contacting Thom Hickey<[log in to unmask]>
> about such an arrangement. Thanks,
> Roy
>
> On Fri, May 18, 2012 at 3:46 AM, Arash.Joorabchi <[log in to unmask]>
> wrote:
> > Dear Karen,
> >
> > I am conducting a research experiment on automatic text classification
> and I am trying to retrieve top matching bib records (which include DDC
> fields) for a set of keyphrases extracted from a given document. So, I
> suppose this is a rather exceptional use case. In fact, the right approach
> for this experiment is to process the full dump of WorldCat database
> directly rather than sending a limited number of queries via the API.
> >
> > I read here:
> > http://dltj.org/article/worldcat-lld-may-become-available under-odc-by/
> > that WorldCat might become available as open linked data in future,
> which would solve my problem and help similar text mining projects.
> However, I wonder if it is currently available to researchers under a
> research/non-commercial use license agreement.
> >
> > Regards,
> > Arash
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Karen Coombs
> > Sent: 17 May 2012 08:37
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records
> without a DDC no from the result set
> >
> > I forwarded this thread to the Product Manager for the WorldCat Search
> > API. She responded back that unfortunately this query is not possible
> > using the API at this time.
> >
> > FYI, the SRU interface to WorldCat Search API doesn't currently
> > support any scan type searches either.
> >
> > Is there a particular use case you're trying to support? Know that
> > would help us document this as a possible enhancement.
> >
> > Karen
> >
> > Karen Coombs
> > Senior Product Analyst
> > Web Services
> > OCLC
> > [log in to unmask]
> >
> > On Wed, May 16, 2012 at 9:49 PM, Arash.Joorabchi <[log in to unmask]>
> wrote:
> >> Hi Andy,
> >>
> >>
> >>
> >> I am a SRU newbie myself, so I don't know how this could be achieved
> >> using scan operations and could not find much info on SRU website
> >> (http://www.loc.gov/standards/sru/).
> >>
> >> As for the wildcards, according to this guide:
> >>
> http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea
> >> rchworldcatquickreference.pdf the symbols should be preceded by at least
> >> 3 characters, and therefore clauses like:
> >>
> >>
> >>
> >> ... AND srw.dd=*
> >>
> >> ... AND srw.dd=?.*
> >>
> >> ... AND srw/dd=###.*
> >>
> >> ... AND srw/dd=?3.*
> >>
> >>
> >>
> >>
> >>
> >> do not work and result in the following error:
> >>
> >> Diagnostics
> >>
> >> Identifier:
> >>
> >> info:srw/diagnostic/1/9
> >>
> >> Meaning:
> >>
> >>
> >>
> >> Details:
> >>
> >>
> >>
> >> Message:
> >>
> >> Not enough chars in truncated term:Truncated words too short(9)
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Arash
> >>
> >>
> >>
> >> ________________________________
> >>
> >> From: Houghton,Andrew [mailto:[log in to unmask]]
> >> Sent: 16 May 2012 11:58
> >> To: Arash.Joorabchi
> >> Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records
> >> without a DDC no from the result set
> >>
> >>
> >>
> >> I'm not an SRU guru, but is it possible to do a scan and look for a
> >> postings of zero?
> >>
> >>
> >>
> >> Andy.
> >>
> >> On May 16, 2012, at 6:39, "Arash.Joorabchi" <[log in to unmask]>
> >> wrote:
> >>
> >>        Hi mark,
> >>
> >>        Srw.dd=* does not work either:
> >>
> >>        Identifier:     info:srw/diagnostic/1/27
> >>        Meaning:
> >>        Details:        srw.dd
> >>        Message:        The index [srw.dd] did not include a searchable
> >> value
> >>
> >>        I suppose the only option left is to retrieve everything and
> >> filter the results on the client side.
> >>
> >>        Thanks for your quick reply.
> >>        Arash
> >>
> >>
> >>        -----Original Message-----
> >>        From: Code for Libraries [mailto:[log in to unmask]] On
> >> Behalf Of Mike Taylor
> >>        Sent: 16 May 2012 10:43
> >>        To: [log in to unmask]
> >>        Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of
> >> records without a DDC no from the result set
> >>
> >>        There is no standard way in CQL to express "field X is not
> >> empty".
> >>        Depending on implementations, NOT srw.dd="" might work (but
> >> evidently
> >>        doesn't in this case).  Another possibility is srw.dd=*, but
> >> again
> >>        that may or may not work, and might be appallingly inefficient
> >> if it
> >>        does.  NOT srw.dd=null will definitely not work: "null" is not a
> >>        special word in CQL.
> >>
> >>        -- Mike.
> >>
> >>
> >>        On 16 May 2012 10:32, Arash.Joorabchi <[log in to unmask]>
> >> wrote:
> >>        >  Hi all,
> >>        >
> >>        > I am sending SRU queries to the WorldCat in the following
> >> form:
> >>        >
> >>        >
> >>        >                String host =
> >>        > "http://worldcat.org/webservices/catalog/search/";
> >>        >            String query = "sru?query=srw.kw=\"" + keyword +
> >> "\""
> >>        >                                + " AND srw.ln exact \"eng\""
> >>        >                                + " AND srw.mt all \"bks\""
> >>        >                                + " AND srw.nt=\"" + keyword +
> >> "\""
> >>        >                                + "&servicelevel=full"
> >>        >                                + "&maximumRecords=100"
> >>        >                              + "&sortKeys=relevance,,0"
> >>        >                                + "&wskey=[wskey]";
> >>        >
> >>        > And it is working fine, however I'd like to limit the results
> >> to those
> >>        > records that have a DDC number assigned to them, but I don't
> >> know what's
> >>        > the right way to specify this limit in the query.
> >>        >
> >>        >  NOT srw.dd=""
> >>        >  NOT srw.dd=null
> >>        >
> >>        > Neither of above work
> >>        >
> >>        >
> >>        > Thanks,
> >>        > Arash
> >>        >
> >>
> >> ________________________________
> >>
> >> No virus found in this message.
> >> Checked by AVG - www.avg.com
> >> Version: 2012.0.2176 / Virus Database: 2425/5001 - Release Date:
> >> 05/15/12
> >
> > -----
> > No virus found in this message.
> > Checked by AVG - www.avg.com
> > Version: 2012.0.2176 / Virus Database: 2425/5004 - Release Date: 05/16/12
>