> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Thomas Dowling
> Sent: Wednesday, September 02, 2009 10:25 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] FW: PURL Server Update 2
> The III crawler has been a pain for years and Innovative has shown no
> interest
> in cleaning it up.  It not only ignores robots.txt, but it hits target
> servers
> just as fast and hard as it can.  If you have a lot of links that a lot
> of III
> catalogs check, its behavior is indistinguishable from a DOS attack.  (I
> know
> because our journals server often used to crash about 2:00am on the
> first of
> the month...)

I see that I didn't fully make the connection to the point I was 
making... which is that there are hardware solutions to these 
issues rather than using robots.txt or sitemap.xml.  If a user 
agent is a problem, then network folks should change the router 
to ignore the user agent or reduce the number of requests it is
allowed to make to the server.

In the case you point to with III hitting the server as fast as 
it can and it looking like a DOS attack to the network which
caused the server to crash, then 1) the router hasn't been setup 
to impose throttling limits on user agents, and 2) the server 
probably isn't part of a server farm that is being load balanced.

In the case of GPO, they mentioned or implied, that they were
having contention issues with user agents hitting the server
while trying to restore the data.  This contention could be
mitigated by imposing lower throttling limits in the router on 
user agents until the data is restored and then raising the 
limits back to the whatever their prescribed SLA (service level
agreement) was.

You really don't need to have a document on the server to tell 
user agents what to do.  You can and should impose a network 
policy on user agents which is far better solution in my opinion.