LISTSERV 16.5 - CODE4LIB Archives

Grr..

:1,$s/work/word/g

(I blame FRBR ;))

e


On Fri, Feb 24, 2012 at 4:52 PM, Ian Ibbotson <[log in to unmask]>wrote:

> Sorry.. late to the discussion...
>
> Isn't this a little apples and oranges?
>
> Surely robots.txt exists because many static resources are served directly
> from a tree structured filesystem?
>
> (Nearly) all OAI requests are responded to by specific service
> applications which are perfectly capable of deciding, on a resource by
> resource basis if an anonymous user should or should not see a given
> resource. As has been said, why would you list a resource in OAI if you
> didn't think **someone** would find it useful. If you want to take
> something out of circulation, you mark it deleted so that clients
> connecting for updates know it should be removed.
>
> OAI isn't about fully enumerating a tree on every visit to see whats new,
> it's about a short and efficient visit to say "What, if anything, has
> changed since I was last here". I don't want to have to walk an entire
> repository of 3 million items to discover item 2999999 was deleted.. I want
> a message to say "Oh, item 2999999 was removed on X".
>
> Not sure my thinking is entirely clear on this, but I don't see the 2
> things as being that similar. (Apart from the work harvest being used in
> the context).
>
> P.S. We never got that beer Owen!
>
> e
>
>
> Ian Ibbotson
> Director
> Knowledge Integration Ltd
> 35 Paradise Street, Sheffield. S3 8PZ
> T: 0114 273 8271
> M: 07968 794 630
> W: http://www.k-int.com
>
>
>
> On Fri, Feb 24, 2012 at 4:31 PM, Owen Stephens <[log in to unmask]> wrote:
>
>> Thanks both...
>>
>> Kyle said: "If someone goes to the trouble of making things available via
>> a protocol that exists only to make things harvestable and
>> then doesn't want it harvested, you can dismiss them ..."
>>
>> True - but that's essentially what Southampton's configuration seems to
>> say.
>>
>> Thomas said: "The M in PMH still stands for Metadata, right?  So opening
>> an OAI-PMH server implicitly says you're willing to share metadata.  I can
>> certainly sympathize with sites wanting to do that but not necessarily
>> wanting to offer anything more than "normal" end-user access to full text."
>>
>> This is a fair point - but I've yet to see an example of a robots.txt
>> file that makes this distinction - that is, in general Google is not being
>> told to not crawl and cache pdfs, while being granted explicit permission
>> to crawl the metadata, no matter what the OAI-PMH situation.
>>
>> Kyle said: "OAI-PMH runs on top of HTTP, so anything robots.txt already
>> applies -- i.e. if they want you to crawl metadata only but not download
>> the objects themselves because they don't want to deal with the load or
>> bandwidth charges, this should be indicated for all crawlers."
>>
>> OK - this suggests a way forward for me. Although I don't think we can
>> regard robots.txt applying across the board for OAI-PMH (as in the
>> Southampton example, the OAI-PMH endpoint is disallowed by robots.txt), it
>> seems to make sense that given a resource identifier in the metadata we
>> could use robots.txt (and I guess potentially x-robots-tag, assuming most
>> of the resources are not simple html) to see whether a web crawler is
>> permitted to crawl it, and so make the right decision about what we do.
>>
>> That sounds vaguely sensible (although I'm still left thinking, maybe we
>> should just use a web crawler and ignore OAI-PMH but I guess this was we
>> maybe get the best of both worlds).
>>
>> Thanks again (and of course further thoughts welcome)
>>
>> Owen
>>
>> Owen Stephens
>> Owen Stephens Consulting
>> Web: http://www.ostephens.com
>> Email: [log in to unmask]
>> Telephone: 0121 288 6936
>>
>> On 24 Feb 2012, at 14:45, Thomas Dowling wrote:
>>
>> > On 02/24/2012 09:25 AM, Kyle Banerjee wrote:
>> >
>> >>> We use OAI-PMH, and while we often see (usually general and sometimes
>> >>> contradictory) statements about what we can/can't do with the
>> contents of a
>> >>> repository (or a specific record), it feels like there isn't a nice
>> simple
>> >>> mechanism for a repository to say "don't harvest this bit".
>> >>>
>> >>
>> >> I would argue there is -- the whole point of OAI-PMH is to make stuff
>> >> available for harvesting. If someone goes to the trouble of making
>> things
>> >> available via a protocol that exists only to make things harvestable
>> and
>> >> then doesn't want it harvested, you can dismiss them as being totally
>> >> mental.
>> >
>> > The M in PMH still stands for Metadata, right?  So opening an OAI-PMH
>> > server implicitly says you're willing to share metadata.  I can
>> certainly
>> > sympathize with sites wanting to do that but not necessarily wanting to
>> > offer anything more than "normal" end-user access to full text.
>> >
>> > That said, in a world with unfriendly bots, the repository should still
>> be
>> > making informed choices about controlling full text crawlers
>> (robots.txt,
>> > meta tags, HTTP cache directives, etc etc.).
>> >
>> >
>> > --
>> > Thomas Dowling
>> > [log in to unmask]
>>
>
>