When we monitor a site we usually think of looking through logs or a gui front end like munin to look at how the server is doing. But if the site itself is down, how do you access this data? Well, you could figure out some way to do this (a common method is to pipe the logs to another location so you have a chance to diagnose things when they go wrong).
But most of the time you just want to be alerted and run basic remote checks on failure to narrow down the possibilities. The most common things to check remotely should be, in my experience: 1. Is the webserver responding? 2. Is the response ok or garbage? 3. Is the connection to the internet broken somewhere along the path? 4. Is the DNS system broken?
You can add more, but this seems to be the most useful stuff you want to have a look through remotely.
So, how do we implement it ?
To do the job, you can roll out a heavyweight nagios or whatever. Or you can go the lightweight route and script something customized up - that's easy too, but also invariably ugly. I like lightweight stuff, but getting something pretty out is better for future maintenance for when I leave srijan.
Fortunately we are spared such ugliness because there is a free web service (free-as-a-service) with a nice GUI interface to do it all for you. So monitoring becomes SEP (with the only catch being that they gather data from the server diagnostics over time).
The service is called basicstate and is pretty slick. Plus they had a responsive and clear customer support for some questions I had. There are two problems with this service though.
1. it doesn't (yet) offer SMS alerts on the server being down for Indian users (the SMS gateways they provide only seem to work sporadically, and if they do work, the delivery delay is often 24 hours.
2. if we did send an SMS alert in showing a problem, by the time the alert is actually noticed the server may be back up/reachable again.
So effectively we have a false positive - though I should point out it is not really a false positive - just a result of excess sensitivity for our purpose. Using the basicstate web interface, we can check if the fault persists for 15 mins or more, which will filter out most of the false positives. But we can't check if the fault persists for a time of less than 15 minutes (using the basicstate interface).
It is not such a big deal, because most people don't need that short a persistence-timecheck, but I decided it would be a useful feature to have for diagnostic reasons, especially for our more critical sites. We can solve both these problems (no sms service and too long a gap between detection and follow up diagnostics) with procmail and an SMS delivery specialist.
Webtext is an SMS delivery specialist, and like basicstate they have a slick interface and a responsive staff. They aren't free though. But they are pretty low-priced for low volume SMS service (I'm thinking of 100 SMSs in a year). (Aside: On choosing outfits that are small, like basicsense and webtext: My experience has given me a lot of faith in the classic Avis principle of smaller outfits trying harder, and therefore often having better products). Where were we? Oh, yeah, I was covering how I was fixing some problems with basicstate, namely by adding tools from webtext (the SMS service) and procmail.
Procmail is a old mail filter tool whose capabilities I've grown to respect. I can't say I like it much - its recipe language is a bit strange, and you have to do some head scratching about odd quirks sometimes to use it (De Morgan's theorem to implement or-like behaviour with ands and negatives is a particular favourite of mine that I wheel out of the recycle bin of my memory most of the times I use it). But yeah, it gets my Respect (note the capitalization) because I've seen it handle crazy loads and remain standing when all other contenders get bogged down and croak and flap their arms around uselessly. Fortunately there is a lot of excellent material on the net that covers how to do procmail recipes (including the de Morgan trick and its ilk). Nancy McGough's procmail quick start is a great start (http://www.ii.com/internet/robots/procmail/qs/).
Joining the pieces
So, how do you combine all these beasts?
1. get your mail from basicstate handled by procmail
2. in procmail, during the filtering, run a diagnostic immediately on the site that is having a problem (webserver, DNS, traceroute)
3. still in procmail (if the above diagnostic showed a problem) wait a little, say 3 minutes or whatever you decide, then run another diagnositic.
(i) If there is *still* a problem, then send an SMS to the sysadmin's cellphone via webtext, and also a copy of the diagnostic results to the sysadmin email account.
(ii) If there is no longer a problem, forget it.
There you are - a lightweight remote monitoring system with SMS alerts and fine tuning to the drop level for false positives, and diagnostics implemented the way you like. How splendid. It means you don't end up red-eyed and bleary and sharpening your axe so you can slaughter your colleagues the morning after. It's working pretty well.
Most failures are network connectivity failures, one of them seems to be a webserver overload failure ("seems" because I have yet to examine the issue in detail, since the junior sysadmin should have got off his pimply ass sometime ago to have a look at it, but hasn't).
Anyway, this simple monitoring system has probably saved me a lot of eyeballing of logs and graphs and lets me be pretty much the first to know when something is wrong - which is as it should be. PJ