This morning, I came to work to find an server dying a not-so-slow death. This machine in question serves as a firewall, filtering bridge, name server, and DHCP server for several networks in the Computer and Network System Administration (CNSA) program.
The CNSA program doesn't have any money right now, so they get machines that have been rotated out of normal production use. The ailing server was once a part of an ECE cluster, and is a Pentium 4 in a bargain-bin 1U rack case.
One of the joys of "inexpensive" (okay, cheap) 1U rack cases is inexpensive (cheap), tiny cooling fans. These fans, with blades around one inch in diameter, must spin fast and hard to move enough air to cool the CPU, disks, power supply, and anything else in the case that consumes power. When they fail, they do so catastrophically.
Log reports clued me into the problem: the disks were failing. I went down to the ad-hoc CNSA server room, and was welcomed by a loud buzzing. I first looked at the console—yep, one of the RAID-1 members was dead—and shut down the machine. I pulled it from the rack and powered it back on with the cover removed. It sounded like an old mechanical fire alarm buzzer, and shook like a concrete vibrator.
I found the offending fan, unplugged it (hrmph, there are plenty more in there), and powered it back up with a new disk in place of the failed one. And we all lived happily ever—hang on a second, what's this? The second disk is failing now too?
After a few moments of head-scratching, I formulated a theory. It goes like this: the fan failed in a way that caused vibration with an amplitude great enough to either corrupt file systems (disk heads not tracking properly) or actually crash the disk heads into the platters. Headline: FAN-OF-DEATH KILL DISKS, SERVER, AREA SYSTEM ADMINISTRATOR. Anyway, I think my theory is more likely than two disks failing simultaneously after a year or so of operation.
So, today's lesson is all about unexpected failure modes. And the surprising effect of a catastrophic cooling fan failure. Happy computing!