The Friday Taro Event

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search

Hello Matt and Steve,

We had a rather serious event which we need to discuss as a "sysadmins" for the NPG group.

On Friday I tried to access Taro, only to get booted right out of my ssh session by an error in the /etc/profile script. Rather surprised, I tried to log in as root from einstein to have the same thing happen to me. Accessing the taro drives from einstein, I found I cannot write to my directory on /net/data/taro/maurik, even though I own the directory. Rather strange. This got me a bit upset, since I had to leave for home in 45 mins for an event at my children's school.

I went down to the server room, 202, and found it about as hot as I have ever seen a room with computers in it. Must have been 86 F or more. Both air conditioners were off. The over temperature allarm of several systems (at least the one on Taro, I am sure, since it also has a light) were wailing. Taro could not execute /sbin/shutdown, so I just turned it off. I also turned off pepper, tomato, lentil, gourd jalapeno and all the Paulis. I was sure to leave einstein and rontgen running. It took a while with the door open and the airco's on full blast to cool off the room, then I rebooted pepper, taro and jalapeno.

Taro came back with a degraded RAID, disk 9 failure. I did not have time to do anything about this drive. Pepper came back OK, Jalapeno came back OK, as did Lentil and Gourd. I did not try to restart Tomato (case is open !??) or Okra?? or any of the Paulis. I wanted a reduced heat load since the room was still too warm.

When I came in today, Monday, I checked up on the systems to find Taro down with a kernel panic. This is potentially a very serious issue. It may be that we damaged a CPU or MB due to the extended period of over temp, The logs indicate that Taro stopped functioning on Friday at 1am, however, it was probably already overheating before that. The disk #9 from Taro was now totally dead. I replaced is with one of the WD 500 drives from Lentil (which is 200MB larger than needed, but will do.) This disk was empty. I think it is labelled NPG-Daily-30.

What to do:

Top Priority:

a) Matt: Insulate the f*&%#!g heating pipes that are heating the room as we are trying to cool it. You can go to the hardware store in town and get pipe insulation. It would also be nice if you could see if there is a way to tell the air conditioners to start themselves up again after a power failure.

b) Steve: We need some way to check our systems better for relatively common problems such as temperature too high, disk about to fail (use SMART), disk failed. If we cannot get a system with "sensors" working I will feel the need to buy one of those hideously expensive environmental monitoring boxes you hook up to an internet port and monitor that. Part of monitoring is of course that humans (as in you guys) make sure the monitoring is still running and not alarming. If possible the monitoring system needs to send out an email (to all 3 of us, plus Lorenzo and Hovanes) for any critical event, and then someone needs to follow up really soon when an event occurs.

Note: See what it is on Gourd that checks the drives. There is something there that sends emails to the shared folder "hardware-events" for various SMART alerts.

Less top priorities:

c) Get the currently down nodes up again. Leave the Paulis off until Jochen confirms he really needs them, since these systems put out an awful lot of heat.

d) Um, there must be something I am forgetting here....

Best, Maurik