This post veers nearer to something I was going to add as an FYI, so here goes...
FYI: NISO has recently started a working group to study best practices for discovery services. The ODI (=Open Discovery Initiative) working group is hoping to look at exactly this issue (how should a content provider tell a content requestor what it can "have") among others (how to convey commercial restrictions, how to produce statistics meaningful to providers, discovery services, and consumers of the discovery service), and hopefully produce guidelines on procedures and formats, etc. for this.
This is a new working group and its timescale doesn't expect any deliverables until Q3 of 2012, so it is a bit late to help Owen, but anyone who is interested in this may want to follow, from time to time, the NISO progress. Look at www.niso.org and find the ODI working group. If you're really interested contact the group to offer thoughts. And many of you may be contacted by a survey to find out your thoughts as part of the process, anyway. Just like the long reach of OCLC, there is no escaping NISO.
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Joe Hourcle
> Sent: Friday, February 24, 2012 10:20 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] "Repositories", OAI-PMH and web crawling
> On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:
> >> One of the questions this raises is what we are/aren't allowed to do
> >> in terms of harvesting full-text. While I realise we could get into
> >> legal stuff here, at the moment we want to put that question to one
> >> side. Instead we want to consider what Google, and other search
> >> engines, do, the mechanisms available to control this, and what we
> >> do, and the equivalent mechanisms - our starting point is that we
> >> don't feel we should be at a disadvantage to a web search engine in
> >> our harvesting and use of repository records.
> >> Of course, Google and other crawlers can crawl the bits of the
> >> repository that are on the open web, and 'good' crawlers will obey
> >> the contents of robots.txt We use OAI-PMH, and while we often see
> >> (usually general and sometimes
> >> contradictory) statements about what we can/can't do with the
> >> contents of a repository (or a specific record), it feels like there
> >> isn't a nice simple mechanism for a repository to say "don't harvest this bit".
> > I would argue there is -- the whole point of OAI-PMH is to make stuff
> > available for harvesting. If someone goes to the trouble of making
> > things available via a protocol that exists only to make things
> > harvestable and then doesn't want it harvested, you can dismiss them
> > as being totally mental.
> I see it like the people who request that their pages not be cached elsewhere -- they want to make
> their object 'discoverable', but they want to control the access to those objects -- so it's one thing
> for a search engine to get a copy, but they don't want that search engine being an agent to distribute
> copies to others.
> Eg, all of the journal publishers who charge access fees -- they want people to find that they have a
> copy of that article that you're interested in ... but they want to collect their $35 for you to read
> In the case of scientific data, the problem is that to make stuff discoverable, we often have to
> perform some lossy transformation to fit some metadata standard, and those standards rarely have
> mechanisms for describing error (accuracy, precision, etc.). You can do some science with the catalog
> records, but it's going to introduce some bias into your results, so you're typically better of
> getting the data from the archive. (and sometimes, they have nice clean catalogs in FITS, VOTable,
> CDF, NetCDF, HDF or whatever their discipline's preferred data format is)
> Also, I don't know if things have changed in the last year, but I seem to remember someone mentioning
> at last year's RDAP (Research Data Access & Preservation) summit that Google had coordinated with some
> libraries for feeds from their catalogs, but was only interested in books, not other objects.
> I don't know how other search engines might use data from OAI-PMH, or if they'd filter it because they
> didn't consider it to be information they cared about.