Kyle said: "If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and
then doesn't want it harvested, you can dismiss them ..."
True - but that's essentially what Southampton's configuration seems to say.
Thomas said: "The M in PMH still stands for Metadata, right? So opening an OAI-PMH server implicitly says you're willing to share metadata. I can certainly sympathize with sites wanting to do that but not necessarily wanting to offer anything more than "normal" end-user access to full text."
This is a fair point - but I've yet to see an example of a robots.txt file that makes this distinction - that is, in general Google is not being told to not crawl and cache pdfs, while being granted explicit permission to crawl the metadata, no matter what the OAI-PMH situation.
Kyle said: "OAI-PMH runs on top of HTTP, so anything robots.txt already applies -- i.e. if they want you to crawl metadata only but not download the objects themselves because they don't want to deal with the load or bandwidth charges, this should be indicated for all crawlers."
OK - this suggests a way forward for me. Although I don't think we can regard robots.txt applying across the board for OAI-PMH (as in the Southampton example, the OAI-PMH endpoint is disallowed by robots.txt), it seems to make sense that given a resource identifier in the metadata we could use robots.txt (and I guess potentially x-robots-tag, assuming most of the resources are not simple html) to see whether a web crawler is permitted to crawl it, and so make the right decision about what we do.
That sounds vaguely sensible (although I'm still left thinking, maybe we should just use a web crawler and ignore OAI-PMH but I guess this was we maybe get the best of both worlds).
Thanks again (and of course further thoughts welcome)
Owen Stephens Consulting
Email: [log in to unmask]
Telephone: 0121 288 6936
On 24 Feb 2012, at 14:45, Thomas Dowling wrote:
> On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
>>> We use OAI-PMH, and while we often see (usually general and sometimes
>>> contradictory) statements about what we can/can't do with the contents of a
>>> repository (or a specific record), it feels like there isn't a nice simple
>>> mechanism for a repository to say "don't harvest this bit".
>> I would argue there is -- the whole point of OAI-PMH is to make stuff
>> available for harvesting. If someone goes to the trouble of making things
>> available via a protocol that exists only to make things harvestable and
>> then doesn't want it harvested, you can dismiss them as being totally
> The M in PMH still stands for Metadata, right? So opening an OAI-PMH
> server implicitly says you're willing to share metadata. I can certainly
> sympathize with sites wanting to do that but not necessarily wanting to
> offer anything more than "normal" end-user access to full text.
> That said, in a world with unfriendly bots, the repository should still be
> making informed choices about controlling full text crawlers (robots.txt,
> meta tags, HTTP cache directives, etc etc.).
> Thomas Dowling
> [log in to unmask]