LISTSERV 16.5 - CODE4LIB Archives

(Sorry for a previous empty message)

Hi Jon,

On Fri, Jan 24, 2014 at 7:56 AM, Jon Phipps <[log in to unmask]> wrote:

> Hi Rob, the conversation continues below...
>
> On Thu, Jan 23, 2014 at 7:01 PM, Robert Sanderson <[log in to unmask]
> >wrote:
> > To present the other side of the argument so that others on the list can
> > make an informed decision...
> Thanks for reminding me that this is an academic panel discussion in front
> of an audience, rather than a conversation.
>

Heh :) I just meant that I wasn't trying to convince you to change, just
that I wanted to voice my concerns.
(But, yes, touché!)


> On Thu, Jan 23, 2014 at 4:22 PM, Jon Phipps <[log in to unmask]> wrote:
> >
> > However if that URI is readable it makes developers lives much easier in
> a
> > lot of situations, and it has no additional cost. Opaque URIs for
> > predicates is the digital equivalent of thumbing your nose at the people
> > you should be courting
>
What you suggest is that an identifier (e.g. @azaroth42 or ORCID:
> 0000-0003-4441-6852 <https://orcid.org/0000-0003-4441-6852>) should always
> be readable as a convenience to the developer.


Those are identifiers for objects or entities, not predicates.   As I said,
I'm happy for entities to have opaque URIs.  Where we disagree is that you
can carry over that same rationale to predicates/properties/relationships.


RDA does provide a 'readable
> in the language of the reader' uri specifically as a convenience to the
> developer. A feature that I lobbied for. It's just not the /canonical/ URI,
> because it's an identifier of a property, not the property itself, and that
> property is independent of the language used to label it.
>

So this, IMO, is where the trouble starts.  People /will/ use those
convenience URIs. And that will make for a nightmare in terms of
interoperability (see below).



> It's the difference between Metadata Management Associates, PO Box 282,
> Jacksonville, NY 14854, USA (for people) and 14854-0282 (a perfectly
> functional complete address in the USA namespace), which is precisely the
> same identifier of that box for machines


Which is also an entity, not a predicate. I almost said "property" there,
which would be amusingly incorrect.



> > Do you have some expectation that in order
> > > for the data to be useful your relational or object database
> identifiers
> > > must be readable?
> >
> > Identifiers for objects, no. The table names and field names? Yes. How
> many
> > DBAs do you know that create tables with opaque identifiers for the
> column
> > names?  How many XML schemas do you know that use opaque identifiers for
> > the element names?
> >
> > My count is 0 from many many many instances.  And the reason is the same
> as
> > having readable predicate URIs -- so that when you look at the table,
> > schema, ontology, triple or what have you, there is some mnemonic value
> > from the name to its intent.
> >
> > Our experience obviously differs in this regard. I've seen many, many
> databases that have relatively opaque column identifiers that were
> relabeled in the query to suit the audience for the query. I've seen many
> French databases, with French content, intended for a French audience,
> designed by French developers, that had French 'column headers'.
>

Yes, but French column headers are not opaque. How many schemas have
completely opaque, non-linguistic column headers, element names, etc?
I'm not talking "relatively opaque", I mean "P12345" or similar. I didn't
count MARC in my 0, which is strictly true as it's not XML or a relational
table, but you could say 1 to be fair.

Yes, sometimes they're PrpCtr or similar, but that's at least somewhat
readable (Property Counter, perhaps?) compared to a UUID or random integer.


The point here is that the identifiers /identify/ a property that exists
> independent of the language of the data being used to describe a resource.
> If RDA _had_ to pick a single language to satisfy your requirement for a
> single readable identifier, which one? To assume that the one language
> should be English says to the non-english speaking world "We don't care
> about you enough to make your
> life one step easier by having something that's memorable"
>

My problem is not with the idea that properties exist independently of
language, it's the side effect of not picking a language to use.  If you
had to pick one, then you should pick one.  If you want to make a political
stand, don't pick English. But at least pick one, and only one.

Not caring about the non-English speaking world is at least caring about
some people, rather than no one.  Or the non-French speaking world.


Despite the fact that developers are surrounded by English I've worked with
> many highly skilled developers who didn't speak or read English. Who relied
> on documentation and meetings in their own language.


Likewise, though admittedly primarily European languages rather than Asian.
 However even if someone doesn't speak English (or Italian, or French, or
German), a language-based construct is more memorable than a completely
opaque one.


An English URI is often nearly as opaque as a
> numeric URI to a non-English-speaking programmer and immediately
> communicates an Anglo-American bias.
>

"often nearly"? :)   That sounds almost like you're saying there are times
when a linguistic URI is still okay.



> RDA's intended audience, as is the case with everything intended to
> function in the global web of data, is the entire world in every language.
> Identifying a thing using a cultural and language specific word or phrase
> instantly biases the general understanding of that thing. And RDA is trying
> very hard to avoid that a priori cultural bias as much as possible.
>

Which is admirable, certainly, but ultimately damaging.  A 100% politically
correct but unused vocabulary doesn't really help anyone.


> >  I grant that writing ad
> > > hoc sparql queries with opaque URIs can be intensely frustrating, but
> the
> > > vocabularies aren't designed specifically to support that
> incredibly narrow
> > > use case.
> >
> > Writing queries is something developers have to do to work with data.
>  More
> > importantly, writing code that builds the triples in the first place is
> > something that developers have to do. And they have to get it right ...
> > which they likely won't do first time. There will be typos. That P1523235
> > might be written into the code as P1533235 ... an impossible to spot
> typo.
> >  dc:title vs dc:titel ... a bit easier to spot, no?
>
> A machine trying to resolve a mis-spelled, non-existent URI is a much
> better spell-checker than any developer will ever be.


Non-existent, sure. But the chances are high that there will be collisions
due to typos and you'll be assigning subjects of street addresses.

Combined with # rather than / and you have to parse the response to
determine whether or not the predicate exists. And isn't "Introduction" or
"References".   Secondly, a machine can equally easily determine that
.../title does exist when .../titel does not, so I fail to see how opaque
identifiers are any better.



Just to clarify:
> You (and others who think like you in the audience) would be fine with:
> rdaa:addresseeOf a rdf:Property
>     owl:sameAs rdaa:P50209
> but not:
> rdaa:P50209 a rdf:Property
>     owl:sameAs rdaa:addresseeOf
>

No. Either make your political standpoint and stick with rdaa:P50209, OR
use a memorable URI like rdaa:addresseeOf, but do not do both.




> And that
> dozens or hundreds of lexical identifiers for the same thing, just to make
> life easier for developers is a bad thing. And that best practice would be
> to coin a single, readable-in-English URI.
>

Yes. Or non English, but the rest of the RDF (and computing) world has
picked English.


> I'm afraid that I won't ever agree with that perspective, when producing
> data for global distribution and consumption.
>

And hence my opening line :)



> I'm personally not entirely happy with hundreds of sameAs lexical URIs.


I think you meant hundreds /of thousands/ of, right?
1600 * (number of languages in the world +1) ?


An
> alternative would be a lookup service that given a label returned the
> canonical URI. But I think that's more of an inconvenience to the developer
> than the simple ability to use a memorable URI, based on a label in their
> language, and have it resolve (permanently) to a canonical, opaque URI when
> accessed by a machine: "Use 'em all, and let the machines figure it out."
>

Let the developers write code to have the machines figure it out.  And let
the server at your end deal with sustained lookups all day, every day.  The
W3C has to throttle requests against their DTDs, which is one lookup per
instance.  You're suggesting that EVERY triple require a dereference.

A lookup table, even locally, of hundreds of thousands of URI mappings is
not something anyone wants to deal with. Even, I'll bet, Malay developers
who don't speak English.


> All in my opinion, and all debatable. I hope that your choice goes well
> for
> > you,
>
> I'd like to repeat: just because I agree with that choice, and I'm
> defending it here, it wasn't my choice. Not at all. And the concerns you
> express were well-aired and very carefully considered before the choice was
> made.
>

And yours :)


> but would like other people to think about it carefully before
> > following suit.
> >
> Me too! :-)
> Jon
> ...who now has to go deal with the consequences of an ill-considered
> decision to deploy an unfamiliar nginx server, on a tight deadline, instead
> of my happy buddy Apache
>

Best of luck! :)

Rob