Thank you Roy and Simon for the info.
As for your second point, I suppose one advantage of using the WorldCat
API at this experimental stage is that the returned bib records are
already FRBR-ized.
Ross - Thanks for the link of Open Library data dump. WorldCat
collection is 2 orders of magnitude larger than open library which makes
a significant difference considering the skewness and sparsity of bib
records classified according to library taxonomies, e.g., DDC, LCC (for
more info, see:
http://cdm15003.contentdm.oclc.org/cdm/singleitem/collection/p267701coll
27/id/277/rec/28)
Thanks,
Arash
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Simon Spero
Sent: 22 May 2012 19:47
To: [log in to unmask]
Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of records
without a DDC no from the result set
Arash - you might not want to use a straight dump of worldcat catalog
records- at least not without the associated holdings information.*
There are a lot of quasi-duplicate records that are sufficiently broken
that the worldcat de-duplication algorithm refuses to merge them. These
records will usually only be used by a handful of institutions; the
better
records will tend to have more associated holdings. The holdings count
should be used to weight the strength of association between class
numbers
and features.
Also, since classification/categorization is something that is usually
considered to be a property of works, rather than manifestations, one
might
get better results by using Work sets for training.
I would suggest, er, contacting Thom Hickey.
Simon
* Well, not precisely holdings - you just need the number of distinct
institutions with at least one copy. I call them 'hasings'.
On Sat, May 19, 2012 at 8:42 PM, Roy Tennant <[log in to unmask]>
wrote:
> Arash,
> Yes, we have made WorldCat available to researchers under a special
> license agreement. I suggest contacting Thom Hickey<[log in to unmask]>
> about such an arrangement. Thanks,
> Roy
>
> On Fri, May 18, 2012 at 3:46 AM, Arash.Joorabchi
<[log in to unmask]>
> wrote:
> > Dear Karen,
> >
> > I am conducting a research experiment on automatic text
classification
> and I am trying to retrieve top matching bib records (which include
DDC
> fields) for a set of keyphrases extracted from a given document. So, I
> suppose this is a rather exceptional use case. In fact, the right
approach
> for this experiment is to process the full dump of WorldCat database
> directly rather than sending a limited number of queries via the API.
> >
> > I read here:
> > http://dltj.org/article/worldcat-lld-may-become-available
under-odc-by/
> > that WorldCat might become available as open linked data in future,
> which would solve my problem and help similar text mining projects.
> However, I wonder if it is currently available to researchers under a
> research/non-commercial use license agreement.
> >
> > Regards,
> > Arash
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
Of
> Karen Coombs
> > Sent: 17 May 2012 08:37
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of
records
> without a DDC no from the result set
> >
> > I forwarded this thread to the Product Manager for the WorldCat
Search
> > API. She responded back that unfortunately this query is not
possible
> > using the API at this time.
> >
> > FYI, the SRU interface to WorldCat Search API doesn't currently
> > support any scan type searches either.
> >
> > Is there a particular use case you're trying to support? Know that
> > would help us document this as a possible enhancement.
> >
> > Karen
> >
> > Karen Coombs
> > Senior Product Analyst
> > Web Services
> > OCLC
> > [log in to unmask]
> >
> > On Wed, May 16, 2012 at 9:49 PM, Arash.Joorabchi
<[log in to unmask]>
> wrote:
> >> Hi Andy,
> >>
> >>
> >>
> >> I am a SRU newbie myself, so I don't know how this could be
achieved
> >> using scan operations and could not find much info on SRU website
> >> (http://www.loc.gov/standards/sru/).
> >>
> >> As for the wildcards, according to this guide:
> >>
>
http://www.oclc.org/support/documentation/worldcat/searching/refcard/sea
> >> rchworldcatquickreference.pdf the symbols should be preceded by at
least
> >> 3 characters, and therefore clauses like:
> >>
> >>
> >>
> >> ... AND srw.dd=*
> >>
> >> ... AND srw.dd=?.*
> >>
> >> ... AND srw/dd=###.*
> >>
> >> ... AND srw/dd=?3.*
> >>
> >>
> >>
> >>
> >>
> >> do not work and result in the following error:
> >>
> >> Diagnostics
> >>
> >> Identifier:
> >>
> >> info:srw/diagnostic/1/9
> >>
> >> Meaning:
> >>
> >>
> >>
> >> Details:
> >>
> >>
> >>
> >> Message:
> >>
> >> Not enough chars in truncated term:Truncated words too short(9)
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Arash
> >>
> >>
> >>
> >> ________________________________
> >>
> >> From: Houghton,Andrew [mailto:[log in to unmask]]
> >> Sent: 16 May 2012 11:58
> >> To: Arash.Joorabchi
> >> Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination of
records
> >> without a DDC no from the result set
> >>
> >>
> >>
> >> I'm not an SRU guru, but is it possible to do a scan and look for a
> >> postings of zero?
> >>
> >>
> >>
> >> Andy.
> >>
> >> On May 16, 2012, at 6:39, "Arash.Joorabchi" <[log in to unmask]>
> >> wrote:
> >>
> >> Hi mark,
> >>
> >> Srw.dd=* does not work either:
> >>
> >> Identifier: info:srw/diagnostic/1/27
> >> Meaning:
> >> Details: srw.dd
> >> Message: The index [srw.dd] did not include a
searchable
> >> value
> >>
> >> I suppose the only option left is to retrieve everything and
> >> filter the results on the client side.
> >>
> >> Thanks for your quick reply.
> >> Arash
> >>
> >>
> >> -----Original Message-----
> >> From: Code for Libraries [mailto:[log in to unmask]]
On
> >> Behalf Of Mike Taylor
> >> Sent: 16 May 2012 10:43
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] WorldCat SRU queries - elimination
of
> >> records without a DDC no from the result set
> >>
> >> There is no standard way in CQL to express "field X is not
> >> empty".
> >> Depending on implementations, NOT srw.dd="" might work (but
> >> evidently
> >> doesn't in this case). Another possibility is srw.dd=*, but
> >> again
> >> that may or may not work, and might be appallingly
inefficient
> >> if it
> >> does. NOT srw.dd=null will definitely not work: "null" is
not a
> >> special word in CQL.
> >>
> >> -- Mike.
> >>
> >>
> >> On 16 May 2012 10:32, Arash.Joorabchi
<[log in to unmask]>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I am sending SRU queries to the WorldCat in the following
> >> form:
> >> >
> >> >
> >> > String host =
> >> > "http://worldcat.org/webservices/catalog/search/";
> >> > String query = "sru?query=srw.kw=\"" + keyword
+
> >> "\""
> >> > + " AND srw.ln exact
\"eng\""
> >> > + " AND srw.mt all \"bks\""
> >> > + " AND srw.nt=\"" +
keyword +
> >> "\""
> >> > + "&servicelevel=full"
> >> > + "&maximumRecords=100"
> >> > + "&sortKeys=relevance,,0"
> >> > + "&wskey=[wskey]";
> >> >
> >> > And it is working fine, however I'd like to limit the
results
> >> to those
> >> > records that have a DDC number assigned to them, but I
don't
> >> know what's
> >> > the right way to specify this limit in the query.
> >> >
> >> > NOT srw.dd=""
> >> > NOT srw.dd=null
> >> >
> >> > Neither of above work
> >> >
> >> >
> >> > Thanks,
> >> > Arash
> >> >
> >>
> >> ________________________________
> >>
> >> No virus found in this message.
> >> Checked by AVG - www.avg.com
> >> Version: 2012.0.2176 / Virus Database: 2425/5001 - Release Date:
> >> 05/15/12
> >
> > -----
> > No virus found in this message.
> > Checked by AVG - www.avg.com
> > Version: 2012.0.2176 / Virus Database: 2425/5004 - Release Date:
05/16/12
>
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.2176 / Virus Database: 2425/5015 - Release Date:
05/22/12
|