Hardware Issues History
Lentil's bad hair day.
I noticed in the morning on Saturday, 12/19 that backups had not run, and Lentil was down. Upon closer inspection it appeared that Lentil had gone down for a reboot sometime the previous night but had failed to reboot. It was getting as far as loading the kernel, but would return an error and fail without booting. This was the error:
request_module: runaway loop modprobe binfmt_464c
It appears that Red Hat had pushed some automatic updates, including a kernel update. I admit to my embarrassment that we must have overlooked Lentil after the recent Taro kernel update fiasco. We made sure to double check every system's update settings to make sure this won't happen to any other systems. In any case, there weren't any other options for kernels to boot in the grub menu, and I couldn't find a decent way via the grub console to see which other kernels were installed in /boot (you can't ls from grub, apparently), and so I tried to boot a live cd in order to mount the system drive and edit the grub.conf to allow me to boot the previous kernel. Much to my dismay the CD drive appears to have failed during my attempt to boot a live system (I'm not sure what the deal is there, but I got an error loading the live system and then the drive wouldn't eject. I'll investigate whether the drive is really dead and consider possible replacement options eventually, but the cd drive isn't critical). With that option now unavailable I opted for taking out Lentil's system drive and connecting it to another machine in order to mount it and edit the grub configuration.
At this point Josh came in to assist, and we decided that our next best option would be to connect Lentil's drive to another machine and attempt to access it that way. Lo and behold I plugged the drive into Feynman and like magic it loaded the kernel with no errors and booted the system. Due to the nature of the error I mentioned earlier I thought at first that Red Hat had done something really stupid, like pushing a 64-bit kernel update to a 32-bit system, because Feynman has a 64-bit processor and was able to boot every single kernel installed on Lentil (there were two others that were installed but not included in the grub menu. I added them while I had it booted on feynman), but none would boot on the Lentil hardware.
Our next step was to attempt to reinstall the kernel package from an RPM which we knew for sure the correct architecture. This ended with the same results as before, the kernel boots on Feynman, but not on Lentil. At this point we had reached a bit of an impasse. We were considering reinstalling the OS, but before that we decided that since the system was down already anyway we should go ahead and swap the HD enclosure with the bad bay in Lentil with the good one in the old Tomato system. I didn't expect this to solve the issue we were having, but I thought it was a good way to make the best of a bad situation. After the hardware was installed I suggested we start the system up just to make sure that it would POST to confirm that we had installed the hardware correctly. Not only did it POST, but it proceeded to load the kernel and boot the system with no issues. The backups were still in place and no data was lost, and everything seemed to be in order. We added the new blank 1TB drives to the system as the next two backup drives and so far it appears to be functioning normally.
The only remaining issue is that the system drive is giving some SMART errors. It most likely will need to be replaced in the near future.
The einstein incident. Because we wanted hotswap on einstein and had not yet confirmed that the new card worked, I thought it would be a good idea to pull a blank, not mounted hard drive and watch the dmesg status. I was wrong. Turns out, the new card uses the sata_mv driver, which doesn't support hotswap. Therefore, everything hooked up to the card got trashed. The system was unworkable, so I rebooted, to find that we kernel panics, no matter which one we choose. A bit of investigation showed that the kernels were all at very large inodes, which may or may not have had something to do with it.
Given our previous experiences with einstein dying with similar symptoms, it was decided by Maurik that we would reinstall.
We pulled out the root/mail drive and set it aside as a backup, and we did the same with one of the root/mail/home drives. We cleared the old root partition from the remaining root/mail/home drive, and installed onto that. On sucessful install, I upgraded everything to RHEL5.3 immediately. Once that was set, I tested mounting the mail and home partitions as degraded raid arrays, which worked fine. I copied the SSH keys, /root stuff, rsync-backup.conf, autofs files, exports, certs, and sudoers, manually checking each one for integrity. Then, I set the mail and home mount points in fstab.
Once the basics were set up, I worked on ldap. I first installed openldap, openldap-client, and openldap-server, as well as the perl package perl-LDAP. Once that was set, I copied over all the ldap files from the backups, including /etc/ldap.conf and /etc/openldap. To get the ldap database, I mounted the old root partition from the root/mail drive and chrooted into it. From there, I did "slapcat -l dump.ldif" to dump the database. I then exited the chroot environment, and did "slapadd -l dump.ldif" to add the database to the new setup. Checked it out, and everything looked good.
I next set up named, thinking that it would be somewhat necessary to mail operation. I copied over all the old DNS information and checked it by hand to make sure it was okay, and from what I could gather it looked fine. The setup is quite a mess, and needs work. Once I set it up, the service appeared to work, but on a reboot the system hung trying to start named. To get around this, I booted into interactive startup, and selected everything except named. This got me to a working environment again. I changed named to not start at boot, and manually started it. This startup issue has not yet been resolved.
Next, I set up mail. I started with postfix, copying over all the /etc/postfix configurations, and again checking them. Once postfix was installed, I moved onto dovecot, so that I would be able to check the mail system's behaviour. After installing and setting up dovecot, I ran postfix/dovecot to try out the mail. I saw complaints of spamd not running, so I killed dovecot/postfix and installed/configured spamassassin. Restarted the mail, saw spamd user errors, stopped mail, fixed the user number mismatch, and turned mail back on. I was able to access my mail, and the logs looked like mail was coming in. I noticed spamd spawned a lot of child processes, and was taking 100% cpu usage. Sending a test mail from my gmail account didn't result in either a bounce nor a sucessful delivery. I did a STUPID thing and assumed that it was catching up on all the mail it hadn't processed, so I went home for some sleep, seeing it was about 10pm. When I woke up, I checked my gmail and saw no bounce, and checked my einstein mail and saw no message. Something was definitely wrong. I looked through the logs more closely and saw a few complaints about the dovecot sieve plugin missing. I installed it, and saw no more errors, but still not the expected behaviour. Once I came in, I took the time to explain to Josh the mail system's workings, in detail. In doing so, I explained how sendmail can be either sendmail.sendmail or sendmail.postfix, depending on which MTA is in use. Shortly therafter, I realized I had never set the SYSTEM MTA, but instead only set the DOVECOT MTA. Checking /etc/alternatives/mta showed sendmail.sendmail. I removed sendmail with yum, and used the "alternatives" program to set postfix as the proper MTA. Restarting the mail system caused everything to work.
The bad part:
From the moment I had postfix running with sendmail at the same time, the two thrashed back and forth, bouncing and discarding messages as the two MTAs collided. We figure 18 hours of mail got bounced and maybe dropped, but there's not much we can do to find out exactly what, if anything, dropped. The maillog is now over 300 MB, making it quite the task to look through and find drop events. I tried to use "postqueue -f" to flush the queue and deliver all messages, but it didn't do much of anything.
- The memory controller on lentil's motherboard seems to have failed. We replaced the motherboard with an ASUS P5QL-CM, the system starts fine now.
- The downside of this board is that the onboard gigabit lan is poorly supported in RHEL5, and they aren't too keen to fix it: https://bugzilla.redhat.com/show_bug.cgi?id=432171 . We're getting around this by using the kmod package the CentOS people reccommended, http://wiki.centos.org/AdditionalResources/HardwareList/RealTekRTL8111b?action=show&redirect=HardwareList%2FRealTekRTL8111b#head-74817fc80992bd9d6819c23a716e7426778bfcdc .
- RAIDS: A hard drive in gourd's RAID went bad and the array was labeled as "degraded." Around the same time, various login issues appeared, such as long wait times between username and password prompting, and password authentication failing for users. When the drive was replaced, not only was the array fixable, but these other issues went away.
- SMP, Power: Taro was only able to run the single-processor kernel, despite having two processors. When taro's power supply was replaced with a better one capable of delivering more power on the necessary lines, taro was able to run the symmetric multiprocessing kernel. The old powersupply was most likely damaged in a power event. It may have been too low a rating to begin with.
- SMP: Gourd was running on just one CPU, despite having two. This was solved by simply telling GRUB to boot the symmetric multiprocessing Linux kernel, which was not the default. Booting gives a warning that RHEL4 Desktop doesn't support more than one CPU. It seems to work with both active, however. We'll need to investigate it further when we begin the distant transition to RHEL5 so that all CPUs work fully, warning-free.
- Odd Behavior: We noticed Perl scripts on Lentil failing in mysterious ways, which led to our finding a /usr/bin/perl of size 0. After checking the logs, we decided that it wasn't done intentionally, and reinstalled Perl. Later, we couldn't boot lentil. Investigation led us to a /boot/grub/grub.conf of size 0. We decided to do a
find -size 0 -printand got a large list of files of size 0. We decided that lentil's installation drive must be going bad, and took it as an opportunity to install RedHat EL5 on another drive.
- Odd Behavior: ennui could download some webpages but not others, among other inconsistent network behaviors. It turned out that its network card isn't very good, and the MTU had to be set to 1460, rather than the default. This also had to be done to hobo.
- Lentil: (2/24/2009) There has been a lot of kernel panics on lentil lately and the reason is because of failed inodes on the drive. The reason this has not been yet discovered is because Lentils backup scripts mount the drives manually, to add to this the drives are not automounted at bootup. So when Lentil is rebooted the drives are not mounted and never checked, causing the kernel to have a hard crash everytime the backup scripts try to mount the drives.
- The fixes are:
- After booting Lentil run these commnds
- e2fsck <device> (checks ext2 filesystems)
- e3fsck <device> (checks ext3 filesystems)
- After booting Lentil run these commnds
- The fixes are: