Destroying a server with ignorance
#46 Henry, Friday, 05 February 2010 9:56 PM (Category: Work)
(Tags: hylafax raid)

I had a server at work lock up on me. It's an older Fedora installation, no patches applied and it handles our outgoing faxes. We send a thousand faxes a day, and I still wonder why these customers haven't switched to email and what specifically they prefer about faxes.

Something happened to the server. It managed to send an alarm to me, and when I went to look at it, I could not ssh in. Not the usual "cannot connect" or "connection refused", but something more obscured like "cannot obtain resources to establish this connection". I can't remember the exact wording, as I was hyped up by this time and trying to get things working quickly. We looked at the console and it was just scrolling repeatedly errors about journalling and a partition. We couldn't get a root login, couldn't get access to it at all. I hit the switch and powered it down.

When I powered it back up, I got into single user mode. If I see a journalling error, that usually means that something is hosed in the filesystem. I usually run fsck to clean up the filesystems, and that's what I attempted to do. /dev/sdc had three partitions and I ran fsck on each of them. Two were clean, one had a whole mess of errors. After the cleanup, I rebooted. That was a complete failure. Kernel panic of a major sort.

Kernel panic

This was a turnkey system with Hylafax installed,. My boss discovered we had email support still on it. He asked me to write to them and tell them what happened. I did. They reported back that me running fsck on the partitions was the thing that completely hosed it. It's using lvm on hardware RAID and I totally screwed one of the partitions and our only recourse was to reinstall the OS and restore from backup.

This doesn't explain the original problem. But I am quite willing to put my hand up and say "Yes, I totally screwed the fax server". I'm never sure what to do with these RAID things when the filesystem goes bad. I am going to have to do some more study on this and learn what needs to be done to properly handle RAID.

But anyway, our network guy installed CentOS on it, and the support company put Hylafax back on and configured it, and then I checked everything out of Subversion and installed our software on top of it again. Lots of little irritating things had changed. CentOS did not install the libxml2 headers by default, subversion wasn't installed, and I couldn't even find strace. Looks like CentOS is not a development style distribution. Nonetheless, out network guy got all these things installed for me, and I was able to complete the installation. There were some changes necessary. To accommodate a changed directory layout, I modified our main daemon and introduced new config items and then applied them. It all worked, and faxes started flowing again after 9 hours of fax outage. Took an hour to send the 800 backlogged faxes. Then I got to go home, in the snow.

0 comments