<img alt="" src="https://secure.agile365enterprise.com/790157.png" style="display:none;">

Analysis of a screw up

author
By Team Srijan Jul 7, 2008
Analysis of a screw up
Analysis of a screw up

Slick professionalism 

One of our clients wanted a shutdown scheduled one night at 10pm. This was a suitable job for the "at" command, and I set it up. Then I noticed the time on the system was out by several minutes. (This was nothing to do with what I'd just set up with the "at" command - this was simply due to hardware clock drift. This system had been running for over a year non-stop - a record we were proud of.

But, yes, I reckoned I should fix it instead of having the client wait a few minutes extra past 10pm, twiddling his thumbs. I'm a professional, don'tcha know?) There are tools and things to handle time drifts (adjtime, ntpdate, rdate, hwclock) but since I did not want to change anything on the system - this being a production server - I decided to be cautious and use the closer-to-the-bare-metal "date" command. Slick professionalism from me, indeed. 

Slick stupidity 

So, I typed in the time according to what I thought was the correct syntax using date. But the moment I pressed the enter key, the system suddenly thought it was about 5 years into the future. No problem, I thought, I'll just read the man page for date and fix it. But, a few seconds later, the system (having gone through the "at" spool in its usual once-a-minute run) decided the "at" shutdown job I had scheduled for later that night was about 5 years overdue. And to my horror sent me a shutdown message.

All my smug professionalism circled the drain a few times along with the debris of messages from services that neatly shut themselves down. The system sank in front of my eyes as my fingers reflexively and futilely played a tune of despair in the key of ctrl-c. I don't remember much of the tune, but the lyrics went something like: Arghhhh. Arggggh. Arggghhhh.

You may sniff at my melodrama, and I would normally agree that a simple reboot would have sorted things out, and so it would not have been a big deal, but 1. the machine was a remote machine, 2. belonging to an important client and 3, worst of all, I'd just configured the machine to boot up next time in an unreachable state (for a new network).

Talk about cooking your own goose. Calling up the client to explain what had happened and how to fix it was kind of sucky. 

Post mortem: the tech error 

I went through the man pages of date, to see where the technical part of my error was. It turns out that the date setting string for BSD/unix and linux differ. For reference, this example date: Sat Jul  5 12:42:00 IST 2008 is set by: date 0705124208 in linux. and date 0807051242 in UNIX/BSD Who'd have thunk it? (Tip: The linux version can also takes this format if you want to synchronize the seconds field: date 070512422008.58 Sat Jul  5 12:42:58 IST 2008 ) And the moral of that is: Don't let the date command spank you on the ass on the way out. 

Post mortem: the processes error 

Still, this aspect of the error is just the technical part, and is just one of those things a sysadmin can run into. A more important aspect is the process part of the error. What was the process that allowed such a technical error to get through?

When the Nobel Laureate physicist Feynman was called in to help troubleshoot why the Challenger space shuttle blew up, he pretty quickly cut through the technobabble and showed very convincingly with a tabletop experiment how an o-ring had failed due to the cold. He explained that, yes, the o-ring was the direct cause of the disaster. But more generally, he emphasised that there was a management culture failure that allowed the technical failure to happen. He spent most of the report explaining that this was the real failure. While not exactly analogous, where I went wrong was running the date command on a production system when I should have tested it on a local system first.

Definitely a process issue gone wrong. Everyone is tempted to take shortcuts to save time, and most of the time we get away with it. But on a production system you need to make sure everything you do is correct. So the moral of that is: test even the most mundane things thoroughly before running things on a remote production system.

Yes, it can take 10 times as long as a result of that, and it may seem hard to justify it. But this way you save yourself getting your ass spanked by the door on your way out and toppling headfirst into that inviting pile of manure as you head off on to the road to salvation. Pardon my mixed metaphors. PJ

Subscribe to our newsletter