Completed in November/December 2007

From Nuclear Physics Group Documentation Pages
Revision as of 15:56, 7 January 2008 by Minuti (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
  • Environmental monitoring. See The Friday Taro Event for more.
    • For sensors, we need to get them working on just one reliable machine to get decent notification. blackbody has a CPU temp. sensor that works. lm_sensors FAQ - adding more sensors seems to work for bb, but some settings seem to be off. Sensors is working on both Taro and Pepper. We should now tie this into the Cacti readout scheme. (volunteers?)
    • We should also look into dedicated temperature monitoring devices. Ordered, a nice one, should be here before Christmas. This can be read out by SNMP so we can get Okra/Cacti to read and log it.
  • To fix the RAID degradation, a disk has been transferred from lentil to taro. Now lentil is one short. Shouldn't really be a problem, considering how many other spare empty drives we have in lentil. Well, once we figure out why lentil gives us errors in its backup script...
  • Taro and Pepper Iptable configurations: It seems that we have a problem in the iptable configuration, which caused connectivity issues on taro.
  • Taro now has one ethernet port connected to the backend network. Outside connectivity is over the VLAN.
  • Swap jalapeno power supply. No need to schedule downtime, considering it's always down. Has this been done? It's been up all week. The Epsilon power supply we bought for taro way back at the begining of the summer is now in jalapeno.
  • The weather might be getting too cold for running two air conditioners. The top one has been having some issues. Yesterday I came in and it had left the floor wet. Today, it had collected a major amount of ice and started to flash its lights and beep. I turned it off after that. The other day I came in and both were coated in ice and the room was stifling. I defrosted with the door wide open and fans on high. Any chance we can start leaving the window open without running environmental risks to the machines? We'd have to disconnect the top machine to do that, and that's a heavy-duty job, so probably not. See The Friday Taro Event Whoops. Looks like it happened by itself.
  • Made a hard copy of roentgen's getent passwd in /etc/passwd (ditto for group and shadow), and commented out the nis dependencies in roentgen's /etc/nsswitch.conf. Now roentgen should be free of the shackles of NIS. While doing the transfer, I noticed that my account didn't have a shadow entry! I copied the entry from einstein to roentgen, and voila, I could log into roentgen. Good riddance, NIS!
  • Coincidentally, roentgen's UPS started flipping out right after I finished the above. It ran out of power, so I quickly replaced the old UPS with the new one that was being used by Matt's workstation.
  • Temporarily mounted that random box fan that's been in the hall all summer into the old AC's spot. The maintenance guy's note implies that we'll be getting a fan to actually fit the hole, and Maurik said we might get a freestanding AC unit to pump cold air straight onto the machines.
  • Sent a request to order a replacement battery for Roentgen's UPS to Michelle Walker.
  • Something very wrong happened to tomato: I changed grub.conf to boot at runlevel 3 (normal, excludes X) and rebooted, but there was a kernel panic mentioning not being able to find vg_tomato. I rebooted again at the default runlevel, and got the same error. I booted with the intall disk, and it doesn't see any LVM anywhere. Ok, so it sees some LVM on sdd4. It also lists /dev/md1 as a "foreign" filesystem, presumably this is where the LVM should be residing. The kernel panic has a message saying that only 7/8 devices could be found for RAID5 device md1. What are the odds of taro and tomato losing a disk at the same time? Actually, it says 7/8 failed. This doesn't seem possible... Is the RAID card unplugged or something? Was able to make a new RAID, maybe it was a software error?