Print

Print


Sorry.. late to the discussion...

Isn't this a little apples and oranges?

Surely robots.txt exists because many static resources are served directly
from a tree structured filesystem?

(Nearly) all OAI requests are responded to by specific service applications
which are perfectly capable of deciding, on a resource by resource basis if
an anonymous user should or should not see a given resource. As has been
said, why would you list a resource in OAI if you didn't think **someone**
would find it useful. If you want to take something out of circulation, you
mark it deleted so that clients connecting for updates know it should be
removed.

OAI isn't about fully enumerating a tree on every visit to see whats new,
it's about a short and efficient visit to say "What, if anything, has
changed since I was last here". I don't want to have to walk an entire
repository of 3 million items to discover item 2999999 was deleted.. I want
a message to say "Oh, item 2999999 was removed on X".

Not sure my thinking is entirely clear on this, but I don't see the 2
things as being that similar. (Apart from the work harvest being used in
the context).

P.S. We never got that beer Owen!

e


Ian Ibbotson
Director
Knowledge Integration Ltd
35 Paradise Street, Sheffield. S3 8PZ
T: 0114 273 8271
M: 07968 794 630
W: http://www.k-int.com



On Fri, Feb 24, 2012 at 4:31 PM, Owen Stephens <[log in to unmask]> wrote:

> Thanks both...
>
> Kyle said: "If someone goes to the trouble of making things available via
> a protocol that exists only to make things harvestable and
> then doesn't want it harvested, you can dismiss them ..."
>
> True - but that's essentially what Southampton's configuration seems to
> say.
>
> Thomas said: "The M in PMH still stands for Metadata, right?  So opening
> an OAI-PMH server implicitly says you're willing to share metadata.  I can
> certainly sympathize with sites wanting to do that but not necessarily
> wanting to offer anything more than "normal" end-user access to full text."
>
> This is a fair point - but I've yet to see an example of a robots.txt file
> that makes this distinction - that is, in general Google is not being told
> to not crawl and cache pdfs, while being granted explicit permission to
> crawl the metadata, no matter what the OAI-PMH situation.
>
> Kyle said: "OAI-PMH runs on top of HTTP, so anything robots.txt already
> applies -- i.e. if they want you to crawl metadata only but not download
> the objects themselves because they don't want to deal with the load or
> bandwidth charges, this should be indicated for all crawlers."
>
> OK - this suggests a way forward for me. Although I don't think we can
> regard robots.txt applying across the board for OAI-PMH (as in the
> Southampton example, the OAI-PMH endpoint is disallowed by robots.txt), it
> seems to make sense that given a resource identifier in the metadata we
> could use robots.txt (and I guess potentially x-robots-tag, assuming most
> of the resources are not simple html) to see whether a web crawler is
> permitted to crawl it, and so make the right decision about what we do.
>
> That sounds vaguely sensible (although I'm still left thinking, maybe we
> should just use a web crawler and ignore OAI-PMH but I guess this was we
> maybe get the best of both worlds).
>
> Thanks again (and of course further thoughts welcome)
>
> Owen
>
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: [log in to unmask]
> Telephone: 0121 288 6936
>
> On 24 Feb 2012, at 14:45, Thomas Dowling wrote:
>
> > On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
> >
> >>> We use OAI-PMH, and while we often see (usually general and sometimes
> >>> contradictory) statements about what we can/can't do with the contents
> of a
> >>> repository (or a specific record), it feels like there isn't a nice
> simple
> >>> mechanism for a repository to say "don't harvest this bit".
> >>>
> >>
> >> I would argue there is -- the whole point of OAI-PMH is to make stuff
> >> available for harvesting. If someone goes to the trouble of making
> things
> >> available via a protocol that exists only to make things harvestable and
> >> then doesn't want it harvested, you can dismiss them as being totally
> >> mental.
> >
> > The M in PMH still stands for Metadata, right?  So opening an OAI-PMH
> > server implicitly says you're willing to share metadata.  I can certainly
> > sympathize with sites wanting to do that but not necessarily wanting to
> > offer anything more than "normal" end-user access to full text.
> >
> > That said, in a world with unfriendly bots, the repository should still
> be
> > making informed choices about controlling full text crawlers (robots.txt,
> > meta tags, HTTP cache directives, etc etc.).
> >
> >
> > --
> > Thomas Dowling
> > [log in to unmask]
>