LISTSERV 16.5 - CODE4LIB Archives

On 09/02/2009 09:55 AM, Houghton,Andrew wrote:
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> David Fiander
>> Sent: Wednesday, September 02, 2009 9:32 AM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] FW: PURL Server Update 2
>>
>> If Millenium is acting like a robot in its
>> monthly maintenance processes, then it should be checking robots.txt.
> 
> User agents are *not required* to check robots.txt...


That's probably why David said "should".  Or, "SHOULD".  :-)

The III crawler has been a pain for years and Innovative has shown no interest
in cleaning it up.  It not only ignores robots.txt, but it hits target servers
just as fast and hard as it can.  If you have a lot of links that a lot of III
catalogs check, its behavior is indistinguisable from a DOS attack.  (I know
because our journals server often used to crash about 2:00am on the first of
the month...)

> 
> However, if you have publicly accessible URIs it is highly unlikely
> that you would restrict access to those of URIs in your robots.txt.
> It kink-of defeats the purpose of the URIs being *public*.


True.  I think it's reasonable at this point to ask crawlers to support both
robots.txt, and also sitemap.xml for information on public pages and how often
to check them.


-- 
Thomas Dowling
[log in to unmask]