Print

Print


Hi Karen,

Those that want to play with this data in their own triplestore may be
interested in my post about doing that myself:  Putting WorldCat Data Into
A Triple Store<http://dataliberate.com/2012/08/putting-worldcat-data-into-a-triple-store/>

As the release of this data is experimental, and [with the help of
conversations like this] the way we are transforming the data is evolving,
I do not expect to be able to 'publish official documentation' as it will
change over time.  My hope, post vacation time, is to get someone to share
some of the steps they went through in this process, maybe in a blog post.

I am intrigued by your identification of punctuation differences, seems
like one of the outputs has been through an extra cleanup step.  I will
find out.

On the creation of multiple identifiers for each instance of a place name -
this is a symptom of the way the experimental data is created using what
are called blank nodes.  Ideally we would have minted a URI for each unique
place and linked all references to it.  Unfortunately, this was not easily
achievable, as part of the experiment, on top of production WorldCat.
 Solving issues such as this are on the agenda as our work in this area
evolves.

Keep the comments coming - they are very helpful.

~Richard.

On 23 August 2012 00:56, Karen Coyle <[log in to unmask]> wrote:

> On 8/22/12 2:56 PM, Richard Wallis wrote:
>
>> Hi Karen,
>>
>> I was not ignoring you previous question about where, in Marc terms, data
>> was coming from.  I need to talk with someone who was in the core of the
>> processing that produces the data.  Unfortunately I am currently being
>> thwarted by vacations.
>>
>
> Richard, I understand, and apologize if I appeared to be pushing too hard.
> In my own experience, requests for documentation are met with groans,
> especially by folks who'd rather be "doing something useful," like writing
> code. Unfortunately, it really helps to explain what you've done.
>
> I think I've solved the question of where the place of publication comes
> from: 260 $a. The differences between the Web version and the triples
> version is in punctuation. I'm still looking at examples, but it's a slog
> since I'm re-creating records in the triples file with my minimal knowledge
> of "grep" -- a hammer, but the best darned hammer there is. Here are some
> examples:
>
> #1
> File:
>
> <http://www.worldcat.org/oclc/**43836713<http://www.worldcat.org/oclc/43836713>>
> <http://purl.org/library/**placeOfPublication<http://purl.org/library/placeOfPublication>>
> _:**AX2dX40d4c600X3aX138a12b56f9X3**aXX2dX49b9
> _:**AX2dX40d4c600X3aX138a12b56f9X3**aXX2dX49b9 <http://schema.org/name>
> "New York"
>
>
> Web: (Using RDFa It Firefox plugin [1])
> <http://www.worldcat.org/oclc/**43836713<http://www.worldcat.org/oclc/43836713>>
> a schema:Book;
>       library:placeOfPublication [ a schema:Place;
>          schema:name "New York :"@en ];
>
> #2
>
> File:
> _:**A52eb8ca1X3aX138a1313c61X3aXX2**dX7536 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**A52eb8ca1X3aX138a1313c61X3aXX2**dX7536 <http://www.w3.org/1999/02/22-*
> *rdf-syntax-ns#type <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> <
> http://schema.org/Place> .
> <http://www.worldcat.org/oclc/**524483<http://www.worldcat.org/oclc/524483>>
> <http://purl.org/library/**placeOfPublication<http://purl.org/library/placeOfPublication>>
> _:**A52eb8ca1X3aX138a1313c61X3aXX2**dX7536 .
>
> Web:
> <http://www.worldcat.org/oclc/**524483<http://www.worldcat.org/oclc/524483>>
> a schema:Book;
>     library:holdingsCount "803"@en;
>     library:oclcnum "524483"@en;
>     library:placeOfPublication [ a schema:Place;
>             schema:name "Garden City, N.Y.,"@en ];
>
> Another piece of information is that each instance of a place of
> publication string is given a new identity:
>
> _:**AX2dX44931a01X3aX138a139ed19X3**aXX2dX600d <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX44a4d9f9X3aX138a132b9d9X3**aXX2dX4efe <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX44a4d9f9X3aX138a1378e1dX3**aXX2dX1d8d <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX45b46946X3aX138a139141cX3**aXX2dX7073 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4a6da202X3aX138a1387049X3**aXX2dX7bcc <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4b32d4b9X3aX138a1316a9bX3**aXX2dX5f92 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4b5c4da3X3aX138a135d400X3**aXX2dX515e <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4b93edacX3aX138a1314f3eX3**aXX2dX58e9 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4c810b47X3aX138a134150cX3**aXX2dX5b77 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4f8be47aX3aX138a12b4eb9X3**aXX2dX1677 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX4f8be47aX3aX138a12b4eb9X3**aXX2dX23e1 <http://schema.org/name>
> "Garden City, N.Y." .
> _:**AX2dX52ad903aX3aX138a12d336bX3**aXX2dX5389 <http://schema.org/name>
> "Garden City, N.Y." .
>
>
> Where punctuation doesn't cloud the picture, these could eventually be
> linked to:
>   http://id.loc.gov/authorities/**names/n50068040.html<http://id.loc.gov/authorities/names/n50068040.html>
> and:
>   http://www.geonames.org/**5118226/garden-city.html<http://www.geonames.org/5118226/garden-city.html>
>
> and in that way could have a shared identity.
>
> kc
>
> p.s. Richard and I are on a list with someone who has loaded the triples
> into a database. I will ask if we can announce it here, and will also try
> to figure out how to use the SPARQL endpoint and provide some examples, if
> that is ok with the dc:creator of the database.
>
> [1] javascript:location.href='http**://www.w3.org/2012/pyRdfa/**
> extract?format=turtle&uri='+**escape(location.href)<http://www.w3.org/2012/pyRdfa/extract?format=turtle&uri='+escape(location.href)>
>
>
>> In the meantime, can you let me have a few examples of where you are
>> seeing
>> discrepancies between the download triples and the RDFa embedded in
>> WorldCat.org pages.
>>
>> ~Richard.
>>
>> On 22 August 2012 19:08, Karen Coyle <[log in to unmask]> wrote:
>>
>>  Richard, I've run into yet another area where documentation would be
>>> helpful. There are differences between the schema.org/RDFa that is
>>> embedded in WorldCat data and the exported WorldCat triples in the file.
>>> One of those differences happens to be the source of the place of
>>> publication, if I am reading it right. So, again, a request for
>>> documentation on the fields included and their MARC source.
>>>
>>> Thanks,
>>>
>>> kc
>>>
>>> On 8/17/12 8:38 AM, Richard Wallis wrote:
>>>
>>>  In case you missed the press release earlier this week.
>>>>
>>>> You can now download a significant number of RDF triples describing the
>>>> most highly held 1.2 million resources in WorldCat.  Licensed under
>>>> ODC-BY.
>>>>
>>>> I've posted more details on my blog:
>>>> http://dataliberate.com/2012/****08/get-yourself-a-linked-**data-**<http://dataliberate.com/2012/**08/get-yourself-a-linked-data-**>
>>>> piece-of-worldcat-to-play-****with/<http://dataliberate.com/**
>>>> 2012/08/get-yourself-a-linked-**data-piece-of-worldcat-to-**play-with/<http://dataliberate.com/2012/08/get-yourself-a-linked-data-piece-of-worldcat-to-play-with/>
>>>> >
>>>>
>>>> ~Richard.
>>>>
>>>>  --
>>> Karen Coyle
>>> [log in to unmask] http://kcoyle.net
>>> ph: 1-510-540-7596
>>> m: 1-510-435-8234
>>> skype: kcoylenet
>>>
>>>
>>
>>
> --
> Karen Coyle
> [log in to unmask] http://kcoyle.net
> ph: 1-510-540-7596
> m: 1-510-435-8234
> skype: kcoylenet
>



-- 
Richard Wallis
Founder, Data Liberate
http://dataliberate.com
Tel: +44 (0)7767 886 005

Linkedin: http://www.linkedin.com/in/richardwallis
Skype: richard.wallis1
Twitter: @rjw
IM: [log in to unmask]