Difference between revisions of "Hardware Issues History"

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search
Line 1: Line 1:
 +
== 5/18/2009 ==
 +
The einstein incident.
 +
Because we wanted hotswap on einstein and had not yet confirmed that the new card worked, I thought it would be a good idea to pull a blank, not mounted hard drive and watch the dmesg status. I was wrong. Turns out, the new card uses the sata_mv driver, which doesn't support hotswap. Therefore, everything hooked up to the card got trashed. The system was unworkable, so I rebooted, to find that we kernel panics, no matter which one we choose. A bit of investigation showed that the kernels were all at very large inodes, which may or may not have had something to do with it.
 +
 +
Given our previous experiences with einstein dying with similar symptoms, it was decided by Maurik that we would reinstall.
 +
 +
Reinstallation:
 +
 +
We pulled out the root/mail drive and set it aside as a backup, and we did the same with one of the root/mail/home drives. We cleared the old root partition from the remaining root/mail/home drive, and installed onto that. On sucessful install, I upgraded everything to RHEL5.3 immediately. Once that was set, I tested mounting the mail and home partitions as degraded raid arrays, which worked fine. I copied the SSH keys, /root stuff, rsync-backup.conf, autofs files, exports,  certs, and sudoers, manually checking each one for integrity. Then, I set the mail and home mount points in fstab.
 +
 +
LDAP:
 +
 +
Once the basics were set up, I worked on ldap. I first installed openldap, openldap-client, and openldap-server, as well as the perl package perl-LDAP. Once that was set, I copied over all the ldap files from the backups, including /etc/ldap.conf and /etc/openldap. To get the ldap database, I mounted the old root partition from the root/mail drive and chrooted into it. From there, I did "slapcat -l dump.ldif" to dump the database. I then exited the chroot environment, and did "slapadd -l dump.ldif" to add the database to the new setup. Checked it out, and everything looked good.
 +
 +
NAMED:
 +
 +
I next set up named, thinking that it would be somewhat necessary to mail operation. I copied over all the old DNS information and checked it by hand to make sure it was okay, and from what I could gather it looked fine. The setup is quite a mess, and needs work. Once I set it up, the service appeared to work, but on a reboot the system hung trying to start named. To get around this, I booted into interactive startup, and selected everything except named. This got me to a working environment again. I changed named to not start at boot, and manually started it. This startup issue has not yet been resolved.
 +
 +
Mail:
 +
 +
Next, I set up mail. I started with postfix, copying over all the /etc/postfix configurations, and again checking them. Once postfix was installed, I moved onto dovecot, so that I would be able to check the mail system's behaviour. After installing and setting up dovecot, I ran postfix/dovecot to try out the mail. I saw complaints of spamd not running, so I killed dovecot/postfix and installed/configured spamassassin. Restarted the mail, saw spamd user errors, stopped mail, fixed the user number mismatch, and turned mail back on. I was able to access my mail, and the logs looked like mail was coming in.  I noticed spamd spawned a lot of child processes, and was taking 100% cpu usage. Sending a test mail from my gmail account didn't result in either a bounce nor a sucessful delivery. I did a STUPID thing and assumed that it was catching up on all the mail it hadn't processed, so I went home for some sleep, seeing it was about 10pm. When I woke up, I checked my gmail and saw no bounce, and checked my einstein mail and saw no message. Something was definitely wrong. I looked through the logs more closely and saw a few complaints about the dovecot sieve plugin missing. I installed it, and saw no more errors, but still not the expected behaviour. Once I came in, I took the time to explain to Josh the mail system's workings, in detail. In doing so, I explained how sendmail can be either sendmail.sendmail or sendmail.postfix, depending on which MTA is in use. Shortly therafter, I realized I had never set the SYSTEM MTA, but instead only set the DOVECOT MTA. Checking /etc/alternatives/mta showed sendmail.sendmail. I removed sendmail with yum, and used the "alternatives" program to set postfix as the proper MTA. Restarting the mail system caused everything to work.
 +
 +
The bad part:
 +
 +
From the moment I had postfix running with sendmail at the same time, the two thrashed back and forth, bouncing and discarding messages as the two MTAs collided. We figure 18 hours of mail got bounced and maybe dropped, but there's not much we can do to find out exactly what, if anything, dropped. The maillog is now over 300 MB, making it quite the task to look through and find drop events. I tried to use "postqueue -f" to flush the queue and deliver all messages, but it didn't do much of anything.
 +
 +
 
== 3/2009 ==
 
== 3/2009 ==
 
* The memory controller on lentil's motherboard seems to have failed. We replaced the motherboard with an ASUS P5QL-CM, the system starts fine now.
 
* The memory controller on lentil's motherboard seems to have failed. We replaced the motherboard with an ASUS P5QL-CM, the system starts fine now.

Revision as of 13:30, 22 May 2009

5/18/2009

The einstein incident. Because we wanted hotswap on einstein and had not yet confirmed that the new card worked, I thought it would be a good idea to pull a blank, not mounted hard drive and watch the dmesg status. I was wrong. Turns out, the new card uses the sata_mv driver, which doesn't support hotswap. Therefore, everything hooked up to the card got trashed. The system was unworkable, so I rebooted, to find that we kernel panics, no matter which one we choose. A bit of investigation showed that the kernels were all at very large inodes, which may or may not have had something to do with it.

Given our previous experiences with einstein dying with similar symptoms, it was decided by Maurik that we would reinstall.

Reinstallation:

We pulled out the root/mail drive and set it aside as a backup, and we did the same with one of the root/mail/home drives. We cleared the old root partition from the remaining root/mail/home drive, and installed onto that. On sucessful install, I upgraded everything to RHEL5.3 immediately. Once that was set, I tested mounting the mail and home partitions as degraded raid arrays, which worked fine. I copied the SSH keys, /root stuff, rsync-backup.conf, autofs files, exports, certs, and sudoers, manually checking each one for integrity. Then, I set the mail and home mount points in fstab.

LDAP:

Once the basics were set up, I worked on ldap. I first installed openldap, openldap-client, and openldap-server, as well as the perl package perl-LDAP. Once that was set, I copied over all the ldap files from the backups, including /etc/ldap.conf and /etc/openldap. To get the ldap database, I mounted the old root partition from the root/mail drive and chrooted into it. From there, I did "slapcat -l dump.ldif" to dump the database. I then exited the chroot environment, and did "slapadd -l dump.ldif" to add the database to the new setup. Checked it out, and everything looked good.

NAMED:

I next set up named, thinking that it would be somewhat necessary to mail operation. I copied over all the old DNS information and checked it by hand to make sure it was okay, and from what I could gather it looked fine. The setup is quite a mess, and needs work. Once I set it up, the service appeared to work, but on a reboot the system hung trying to start named. To get around this, I booted into interactive startup, and selected everything except named. This got me to a working environment again. I changed named to not start at boot, and manually started it. This startup issue has not yet been resolved.

Mail:

Next, I set up mail. I started with postfix, copying over all the /etc/postfix configurations, and again checking them. Once postfix was installed, I moved onto dovecot, so that I would be able to check the mail system's behaviour. After installing and setting up dovecot, I ran postfix/dovecot to try out the mail. I saw complaints of spamd not running, so I killed dovecot/postfix and installed/configured spamassassin. Restarted the mail, saw spamd user errors, stopped mail, fixed the user number mismatch, and turned mail back on. I was able to access my mail, and the logs looked like mail was coming in. I noticed spamd spawned a lot of child processes, and was taking 100% cpu usage. Sending a test mail from my gmail account didn't result in either a bounce nor a sucessful delivery. I did a STUPID thing and assumed that it was catching up on all the mail it hadn't processed, so I went home for some sleep, seeing it was about 10pm. When I woke up, I checked my gmail and saw no bounce, and checked my einstein mail and saw no message. Something was definitely wrong. I looked through the logs more closely and saw a few complaints about the dovecot sieve plugin missing. I installed it, and saw no more errors, but still not the expected behaviour. Once I came in, I took the time to explain to Josh the mail system's workings, in detail. In doing so, I explained how sendmail can be either sendmail.sendmail or sendmail.postfix, depending on which MTA is in use. Shortly therafter, I realized I had never set the SYSTEM MTA, but instead only set the DOVECOT MTA. Checking /etc/alternatives/mta showed sendmail.sendmail. I removed sendmail with yum, and used the "alternatives" program to set postfix as the proper MTA. Restarting the mail system caused everything to work.

The bad part:

From the moment I had postfix running with sendmail at the same time, the two thrashed back and forth, bouncing and discarding messages as the two MTAs collided. We figure 18 hours of mail got bounced and maybe dropped, but there's not much we can do to find out exactly what, if anything, dropped. The maillog is now over 300 MB, making it quite the task to look through and find drop events. I tried to use "postqueue -f" to flush the queue and deliver all messages, but it didn't do much of anything.


3/2009

6/2007

  • RAIDS: A hard drive in gourd's RAID went bad and the array was labeled as "degraded." Around the same time, various login issues appeared, such as long wait times between username and password prompting, and password authentication failing for users. When the drive was replaced, not only was the array fixable, but these other issues went away.
  • SMP, Power: Taro was only able to run the single-processor kernel, despite having two processors. When taro's power supply was replaced with a better one capable of delivering more power on the necessary lines, taro was able to run the symmetric multiprocessing kernel. The old powersupply was most likely damaged in a power event. It may have been too low a rating to begin with.
  • SMP: Gourd was running on just one CPU, despite having two. This was solved by simply telling GRUB to boot the symmetric multiprocessing Linux kernel, which was not the default. Booting gives a warning that RHEL4 Desktop doesn't support more than one CPU. It seems to work with both active, however. We'll need to investigate it further when we begin the distant transition to RHEL5 so that all CPUs work fully, warning-free.
  • Odd Behavior: We noticed Perl scripts on Lentil failing in mysterious ways, which led to our finding a /usr/bin/perl of size 0. After checking the logs, we decided that it wasn't done intentionally, and reinstalled Perl. Later, we couldn't boot lentil. Investigation led us to a /boot/grub/grub.conf of size 0. We decided to do a find -size 0 -print and got a large list of files of size 0. We decided that lentil's installation drive must be going bad, and took it as an opportunity to install RedHat EL5 on another drive.
  • Odd Behavior: ennui could download some webpages but not others, among other inconsistent network behaviors. It turned out that its network card isn't very good, and the MTU had to be set to 1460, rather than the default. This also had to be done to hobo.
  • Lentil: (2/24/2009) There has been a lot of kernel panics on lentil lately and the reason is because of failed inodes on the drive. The reason this has not been yet discovered is because Lentils backup scripts mount the drives manually, to add to this the drives are not automounted at bootup. So when Lentil is rebooted the drives are not mounted and never checked, causing the kernel to have a hard crash everytime the backup scripts try to mount the drives.
    • The fixes are:
      • After booting Lentil run these commnds
        • e2fsck <device> (checks ext2 filesystems)
        • e3fsck <device> (checks ext3 filesystems)