LISTSERV 16.5 - CODE4LIB Archives

Hi,

I summarized my thoughts about identifiers for data formats in a blog 
posting: http://jakoblog.de/2009/05/10/who-identifies-the-identifiers/

In short it’s not a technology issue but a commitment issue and the 
problem of identifying the right identifiers for data formats can be 
reduced to two fundamental rules of thumb:

1. reuse: don’t create new identifiers for things that already have one.

2. document: if you have to create an identifier describe its referent 
as open, clear, and detailled as possible to make it reusable.

A format should be described with a schema (XML Schema, OWL etc.) or at 
least a standard. Mostly this schema already has a namespace or similar 
identifier that can be used for the whole format.

For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
identify MODS. If you need to identify a specific version then you 
should *first* look if such identifiers already exist, *second* push the 
publisher (LOC) to assign official URIs for MODS versions, if this do 
not already exist, or *third* create and document specific URIs and make 
that everyone knows about this identifiers. At the moment there are:

MODS Version 3     http://www.loc.gov/mods/v3
MODS Version 3.0   info:srw/schema/1/mods-v3.0
MODS Version 3.1   info:srw/schema/1/mods-v3.1
MODS Version 3.2   info:srw/schema/1/mods-v3.2
                    info:ofi/fmt:xml:xsd:mods
MODS Version 3.3   info:srw/schema/1/mods-v3.3

The SRU Schemas registry links the "info:srw/schema/1/mods-v3*" 
identifiers to its XML Schemas which is very little documentation but it 
links to http://www.loc.gov/mods/v3 at least in some way.

Ross wrote:

> First, and most importantly, how do we reconcile these different
> identifiers for the same thing?  Can we come up with some agreement on
> which ones we should really use?

Use the one that is documented best.

> Secondly, and this gets to the reason why any of this was brought up
> in the first place, how can we coordinate these identifiers more
> effectively and efficiently to reuse among various specs and
> protocols, but not:
 >
> 1) be tied to a particular community
> 2) require some laborious and lengthy submission and review process to
> just say "hey, here's my FOAF available via UnAPI"

The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about 
identifiers that are not URIs. OAI-PMH at least includes a mechanism to 
map metadataPrefixes to official URIs but this mechanism is not always 
used. If unAPI lacks a way to map a local name to a global URI, we 
should better fix unAPI to tell us:

<?xml version="1.0" encoding="UTF-8"?>
<formats xmlns="http://unapi.info/">
   <format name="foaf" uri="http://xmlns.com/foaf/0.1/"/>
</formats>

unAPI should be revised and specified bore strictly to become an RFC 
anyway. Yes, this requires a laborious and lengthy submission and review 
process but there is no such thing as a free lunch.

> 3) be so lax that it throws all hope of authority out the window

Reuse existing authorities and document better to create authority.

> I would expect the various communities to still maintain their own
> registries of "approved" data formats (well, OpenURL and SRU, anyway
> -- it's not as appropriate to UnAPI or Jangle).

There should be a distinction between descriptive registries that only 
list identifiers and formats that are defined elsewhere and 
authoritative registries that define new identifiers and formats. The 
number of authoritatively defined identifiers should be small for a 
given API because the identifier should better be defined by the creator 
of the format instead by a user of the format. If the creator does not 
support usable identifiers then better talk to him instead of creating 
something in parallel.

Greetings,
Jakob

-- 
Jakob Voß <[log in to unmask]>, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de