This is just a reflection on the earlier name resolution incident. I
find it remarkable how much goes into solving a problem, and the
corollary, how much impact a simple problem can have. Just my braindump
as a relatively novice sysadmin.
Here's the chain of events:
- This morning at 9am, our web server chokes. I see apache is using up
MaxClients
- After poking around the various daemons and looking at logs, I figure
out that everything is running correctly
- I somehow narrow it down to the script that pings the OCLC chat
availability service waiting for 20+ seconds and finally timing out,
*despite* the fact that I thought it was set up with a 2-second timeout
(I don't remember how I got it down to that)
- I shut that down temporarily and disabled our chat function, which got
the server back to normal.
- I browsed the service manually, which worked, and tried two different
techniques in the PHP (file_get_contents() and curl), both of which failed.
- I went to Brooklyn to do some vigilante digitization and have lunch
with my boss
- I got back to the office, saw nothing had changed, and started digging
deeper into the curl request
- I found the name resolution error, which blew my mind
- I tried resolving multiple ways, and failing that, came here
Thanks to all who contributed ideas... amazing how one change to a
vendor DNS server can lead to our web server DOS'ing itself. More
networking knowledge... must get more networking knowledge...
--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
212.742.8770 ext. 2432
http://www.tourolib.org/
Access Problems? Contact [log in to unmask]
|