Grr.. :1,$s/work/word/g (I blame FRBR ;)) e On Fri, Feb 24, 2012 at 4:52 PM, Ian Ibbotson <[log in to unmask]>wrote: > Sorry.. late to the discussion... > > Isn't this a little apples and oranges? > > Surely robots.txt exists because many static resources are served directly > from a tree structured filesystem? > > (Nearly) all OAI requests are responded to by specific service > applications which are perfectly capable of deciding, on a resource by > resource basis if an anonymous user should or should not see a given > resource. As has been said, why would you list a resource in OAI if you > didn't think **someone** would find it useful. If you want to take > something out of circulation, you mark it deleted so that clients > connecting for updates know it should be removed. > > OAI isn't about fully enumerating a tree on every visit to see whats new, > it's about a short and efficient visit to say "What, if anything, has > changed since I was last here". I don't want to have to walk an entire > repository of 3 million items to discover item 2999999 was deleted.. I want > a message to say "Oh, item 2999999 was removed on X". > > Not sure my thinking is entirely clear on this, but I don't see the 2 > things as being that similar. (Apart from the work harvest being used in > the context). > > P.S. We never got that beer Owen! > > e > > > Ian Ibbotson > Director > Knowledge Integration Ltd > 35 Paradise Street, Sheffield. S3 8PZ > T: 0114 273 8271 > M: 07968 794 630 > W: http://www.k-int.com > > > > On Fri, Feb 24, 2012 at 4:31 PM, Owen Stephens <[log in to unmask]> wrote: > >> Thanks both... >> >> Kyle said: "If someone goes to the trouble of making things available via >> a protocol that exists only to make things harvestable and >> then doesn't want it harvested, you can dismiss them ..." >> >> True - but that's essentially what Southampton's configuration seems to >> say. >> >> Thomas said: "The M in PMH still stands for Metadata, right? So opening >> an OAI-PMH server implicitly says you're willing to share metadata. I can >> certainly sympathize with sites wanting to do that but not necessarily >> wanting to offer anything more than "normal" end-user access to full text." >> >> This is a fair point - but I've yet to see an example of a robots.txt >> file that makes this distinction - that is, in general Google is not being >> told to not crawl and cache pdfs, while being granted explicit permission >> to crawl the metadata, no matter what the OAI-PMH situation. >> >> Kyle said: "OAI-PMH runs on top of HTTP, so anything robots.txt already >> applies -- i.e. if they want you to crawl metadata only but not download >> the objects themselves because they don't want to deal with the load or >> bandwidth charges, this should be indicated for all crawlers." >> >> OK - this suggests a way forward for me. Although I don't think we can >> regard robots.txt applying across the board for OAI-PMH (as in the >> Southampton example, the OAI-PMH endpoint is disallowed by robots.txt), it >> seems to make sense that given a resource identifier in the metadata we >> could use robots.txt (and I guess potentially x-robots-tag, assuming most >> of the resources are not simple html) to see whether a web crawler is >> permitted to crawl it, and so make the right decision about what we do. >> >> That sounds vaguely sensible (although I'm still left thinking, maybe we >> should just use a web crawler and ignore OAI-PMH but I guess this was we >> maybe get the best of both worlds). >> >> Thanks again (and of course further thoughts welcome) >> >> Owen >> >> Owen Stephens >> Owen Stephens Consulting >> Web: http://www.ostephens.com >> Email: [log in to unmask] >> Telephone: 0121 288 6936 >> >> On 24 Feb 2012, at 14:45, Thomas Dowling wrote: >> >> > On 02/24/2012 09:25 AM, Kyle Banerjee wrote: >> > >> >>> We use OAI-PMH, and while we often see (usually general and sometimes >> >>> contradictory) statements about what we can/can't do with the >> contents of a >> >>> repository (or a specific record), it feels like there isn't a nice >> simple >> >>> mechanism for a repository to say "don't harvest this bit". >> >>> >> >> >> >> I would argue there is -- the whole point of OAI-PMH is to make stuff >> >> available for harvesting. If someone goes to the trouble of making >> things >> >> available via a protocol that exists only to make things harvestable >> and >> >> then doesn't want it harvested, you can dismiss them as being totally >> >> mental. >> > >> > The M in PMH still stands for Metadata, right? So opening an OAI-PMH >> > server implicitly says you're willing to share metadata. I can >> certainly >> > sympathize with sites wanting to do that but not necessarily wanting to >> > offer anything more than "normal" end-user access to full text. >> > >> > That said, in a world with unfriendly bots, the repository should still >> be >> > making informed choices about controlling full text crawlers >> (robots.txt, >> > meta tags, HTTP cache directives, etc etc.). >> > >> > >> > -- >> > Thomas Dowling >> > [log in to unmask] >> > >