LISTSERV 16.5 - CODE4LIB Archives

Hi Rob, the conversation continues below...

On Thu, Jan 23, 2014 at 7:01 PM, Robert Sanderson <[log in to unmask]>wrote:

> Hi Jon,
>
> To present the other side of the argument so that others on the list can
> make an informed decision...
>

Thanks for reminding me that this is an academic panel discussion in front
of an audience, rather than a conversation.

>
> On Thu, Jan 23, 2014 at 4:22 PM, Jon Phipps <[log in to unmask]> wrote:
>
> > I've developed a quite strong opinion that vocabulary developers should
> not
> > _ever_ think that they can understand the semantics of a vocabulary
> > resource by 'reading' the URI.
>
>
> 100% Agreed. Good documentation is essential for any ontology, and it has
> to be read to understand the semantics. You cannot just look at
> oa:hasTarget, out of context, and have any idea what it refers to.
>
> However if that URI is readable it makes developers lives much easier in a
> lot of situations, and it has no additional cost. Opaque URIs for
> predicates is the digital equivalent of thumbing your nose at the people
> you should be courting -- the people who will actually use your ontology in
> any practical sense.  It says: We don't care about you enough to make your
> life one step easier by having something that's memorable. You will always
> have to go back to the ontology every time and reread this documentation,
> over and over and over again.
>

What you suggest is that an identifier (e.g. @azaroth42 or ORCID:
0000-0003-4441-6852 <https://orcid.org/0000-0003-4441-6852>) should always
be readable as a convenience to the developer. RDA does provide a 'readable
in the language of the reader' uri specifically as a convenience to the
developer. A feature that I lobbied for. It's just not the /canonical/ URI,
because it's an identifier of a property, not the property itself, and that
property is independent of the language used to label it.

It's the difference between Metadata Management Associates, PO Box 282,
Jacksonville, NY 14854, USA (for people) and 14854-0282 (a perfectly
functional complete address in the USA namespace), which is precisely the
same identifier of that box for machines, and ultimately for the
postmaster, who doesn't care whose name is on the box numbered 282, who
only needs to know that highly memorable name when someone uses the
convenience of not bothering to look up the box number and just sends mail
addressed to us at 14854, or even just Jacksonville. And no I don't want to
start a URL vs. URI/URN/IRI discussion.

>
> Do you have some expectation that in order
> > for the data to be useful your relational or object database identifiers
> > must be readable?
>
>
> Identifiers for objects, no. The table names and field names? Yes. How many
> DBAs do you know that create tables with opaque identifiers for the column
> names?  How many XML schemas do you know that use opaque identifiers for
> the element names?
>
> My count is 0 from many many many instances.  And the reason is the same as
> having readable predicate URIs -- so that when you look at the table,
> schema, ontology, triple or what have you, there is some mnemonic value
> from the name to its intent.
>
> Our experience obviously differs in this regard. I've seen many, many
databases that have relatively opaque column identifiers that were
relabeled in the query to suit the audience for the query. I've seen many
French databases, with French content, intended for a French audience,
designed by French developers, that had French 'column headers'.

The point here is that the identifiers /identify/ a property that exists
independent of the language of the data being used to describe a resource.
If RDA _had_ to pick a single language to satisfy your requirement for a
single readable identifier, which one? To assume that the one language
should be English says to the non-english speaking world "We don't care
about you enough to make your
life one step easier by having something that's memorable"


>
> > By whom, and in English? This to me is a frankly colonial
> > assumption of the dominance of English in the world of metadata.
>
>
> In the world of computing in general. "for" "if" "while" ... all English.
> While there are turing complete languages out there, the ones that don't
> have real world language constructions are toys, like Whitespace for
> example.  Even the lolcats programming language is more usable than
> whitespace.
>
> Again, it's a cost/value consideration.  There are many people who will
> understand English, and when developers program, they're surrounded by it.
> If your intended audience is primarily people who speak French, then you
> would be entirely justified in using URIs with labels from French. Or
> Chinese, though the IRI expansion would be more of a pain :)
>
>
>
Despite the fact that developers are surrounded by English I've worked with
many highly skilled developers who didn't speak or read English. Who relied
on documentation and meetings in their own language. What RDA is trying to
convey is the specific bibliographic knowledge, admittedly limited by the
cultural context of the North American and European bibliographic
communities, that can be broken down into classes of things and some
properties of those things. An English URI is often nearly as opaque as a
numeric URI to a non-English-speaking programmer and immediately
communicates an Anglo-American bias.

RDA's intended audience, as is the case with everything intended to
function in the global web of data, is the entire world in every language.
Identifying a thing using a cultural and language specific word or phrase
instantly biases the general understanding of that thing. And RDA is trying
very hard to avoid that a priori cultural bias as much as possible.


> > The proper
> > understanding of the semantics, although still relatively minimal, is
> from
> > the definition, not the URI.
>
>
> Yes. Any short cuts to *understanding* rather than *remembering* are to be
> avoided.
>
>
>
> > Our coining and inclusion of multilingual
> > (eventually) lexical URIs based on the label is a concession to
> developers
> > who feel that they can't effectively 'use' the vocabularies unless they
> can
> > read the URIs.
>
>
> So in my opinion, as is everything in the mail of course, this is even
> worse. Now instead of 1600 properties, you have 1600 * (number of languages
> +1) properties. And you're going to see them appearing in uses of the
> ontology. Either stick with your opaque identifiers or pick a language for
> the readable ones, and best practice would be English, but doing both is a
> disaster in the making.
>
>
Best practice is not ever English, for the non-English-speaking world.


>
> >  I grant that writing ad
> > hoc sparql queries with opaque URIs can be intensely frustrating, but the
> > vocabularies aren't designed specifically to support that incredibly
> narrow
> > use case.
>
>
> Writing queries is something developers have to do to work with data.  More
> importantly, writing code that builds the triples in the first place is
> something that developers have to do. And they have to get it right ...
> which they likely won't do first time. There will be typos. That P1523235
> might be written into the code as P1533235 ... an impossible to spot typo.
>  dc:title vs dc:titel ... a bit easier to spot, no?
>

A machine trying to resolve a mis-spelled, non-existent URI is a much
better spell-checker than any developer will ever be. The problem here is
that if RDA truly wants to be multilingual, and avoid the cultural bias of
English identifiers, then they either have to provide multiple lexical
identifiers, or provide a lookup service, like many providers of resources
identified by opaque identifiers.


>
> So the consequence is that the quality of the uses of your ontology will go
> down.  If there were 16 fields, maybe there'd be a chance of getting it
> right. But 1600, with 5 digit identifiers, is asking for trouble.


> Compare MARC fields. We all love our 245$a, I know, but dc:title is a lot
> easier to recall. Now imagine those fields are (seemingly) random 5 digit
> codes without significant structure. And that there's 1600 of them. And
> you're asking the developer to use a graph structure that's likely
> unfamiliar to them.
>

Just to clarify:

You (and others who think like you in the audience) would be fine with:
rdaa:addresseeOf a rdf:Property
    owl:sameAs rdaa:P50209

but not:
rdaa:P50209 a rdf:Property
    owl:sameAs rdaa:addresseeOf

Which both say precisely the same thing about the same resource. And that
dozens or hundreds of lexical identifiers for the same thing, just to make
life easier for developers is a bad thing. And that best practice would be
to coin a single, readable-in-English URI.

I'm afraid that I won't ever agree with that perspective, when producing
data for global distribution and consumption.

I'm personally not entirely happy with hundreds of sameAs lexical URIs. An
alternative would be a lookup service that given a label returned the
canonical URI. But I think that's more of an inconvenience to the developer
than the simple ability to use a memorable URI, based on a label in their
language, and have it resolve (permanently) to a canonical, opaque URI when
accessed by a machine: "Use 'em all, and let the machines figure it out."


> All in my opinion, and all debatable. I hope that your choice goes well for
> you,


I'd like to repeat: just because I agree with that choice, and I'm
defending it here, it wasn't my choice. Not at all. And the concerns you
express were well-aired and very carefully considered before the choice was
made.


> but would like other people to think about it carefully before
> following suit.
>

Me too! :-)

Jon
...who now has to go deal with the consequences of an ill-considered
decision to deploy an unfamiliar nginx server, on a tight deadline, instead
of my happy buddy Apache