The longest day

Oi vey, what a long day yesterday was.

On Wednesday at about 4pm one of the servers crashed. It's one of two Dell PowerEdge 1850s we have and they both have this issue where they'll suddenly have a RAID timeout and throw the drives into read-only mode. This, as you can guess, makes things tricky; especially as no error messages can be written to the logs. In fact, it was only because I happened be be sitting watching one of the machines when it did this that I know it's a RAID timeout. Otherwise you just get “Input/Output errors” or seg faults when running most commands. A hard reboot at this point is generally sufficient to get things back (sometimes with a quick fsck).

Dell - albeit under the original description before I saw the RAID timeouts - were unable to help, but eventually we found they had a BIOS flash for the RAID that fixes “[i]ncreased memory timing margins to address the following potential symptoms: System hangs during operating system installs or other heavy I/O, Windows Blue Screens referencing ”Kernel Data Inpage Errors“ or Linux Megaraid timeouts/errors.”

We'd arranged with the colo host to install it on Thursday.

So, with the crash on Weds, a reboot was requested and duly provided. However, the machine did not reboot and further investigation showed it to be dead. No video, no POST - not even BIOS error beeps. Fortunately, we have 24x7 support from Dell and I went down to the colo to be by the machine whilst I called them.

Over the phone, having gone through basic disconnecting the drives, they decided it might be the motherboard and said they'd send over a new one. Our support contract is a 4-hour one and, it being 8pm and not wanting to hang around the colo until the parts and engineer arrived, I decided to take the server home and get them to come there. In hindsight, a very good decision.

The new mobo arrived at about 10pm and the engineer called at midnight - he was a little lost as Dell (who contract their engineers from Siemens apparently) had misspelled my address. He was unfortunately on the other side of the Blackwall tunnel and they were doing some work on that, so he eventually arrived around 1.30am and replaced the motherboard in the machine.

Sadly, the motherboard wasn't the problem and we remained video, POST and BIOS beep free. The engineer, who was a nice Russian chap, thought it was possibly the “riser” though Dell support suggested, as they had to me, that there wasn't a sufficient kink in the cable connection between the front plate and the backplane. In any event a new riser and, just in case, power supply was ordered. They would arrive in 2-4 hours so the engineer went home and I went back to watching DVDs and trying not to fall asleep.

I sort of failed because I was woken by the delivery guy phoning to ask where my address was since Dell had misspelled it. This was around 5am I think, so I called the engineer and told him the good news. He got over by about 6am and replaced the new riser and lo and behold we had video and POST again...

Except we no longer had a RAID controller. Which is bad. After re-installing the original motherboard to check if that fixed it, the engineer surmised the DDR memory on the riser had gone bad and taken out the original riser in the process. With the new riser and the old memory we could boot again, but the RAID controller wasn't available. Another order was put in for new memory and the engineer went home and off-shift.

The third delivery of the night arrived at 9am. A replacement engineer arrived at midday. The new memory worked and, after re-seating one of the drives things were booting. The replacement engineer left and I only had to get the remaining disk errors fixed and the server re-running.

By 3pm I'd cleaned up the system, checked all the databases and ran the few processes that hadn't run the other day because the crash had happened in the middle of them. By 5pm, it was back in the colo and by about 8pm the network had been reconfigured and it was up again.

Comments

No comments yet

Add Comments

You'll need to register to post comments.
You must be logged in as a member to add comment to this blog