> One of the questions this raises is what we are/aren't allowed to do in
> terms of harvesting full-text. While I realise we could get into legal
> stuff here, at the moment we want to put that question to one side. Instead
> we want to consider what Google, and other search engines, do, the
> mechanisms available to control this, and what we do, and the equivalent
> mechanisms - our starting point is that we don't feel we should be at a
> disadvantage to a web search engine in our harvesting and use of repository
> records.
> Of course, Google and other crawlers can crawl the bits of the repository
> that are on the open web, and 'good' crawlers will obey the contents of
> robots.txt
> We use OAI-PMH, and while we often see (usually general and sometimes
> contradictory) statements about what we can/can't do with the contents of a
> repository (or a specific record), it feels like there isn't a nice simple
> mechanism for a repository to say "don't harvest this bit".

I would argue there is -- the whole point of OAI-PMH is to make stuff
available for harvesting. If someone goes to the trouble of making things
available via a protocol that exists only to make things harvestable and
then doesn't want it harvested, you can dismiss them as being totally

OAI-PMH runs on top of HTTP, so anything robots.txt already applies -- i.e.
if they want you to crawl metadata only but not download the objects
themselves because they don't want to deal with the load or bandwidth
charges, this should be indicated for all crawlers.


Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
[log in to unmask] / 503.999.9787