Completed in February 2008

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search

Pumpkin Setup and Xen

Pumpkin is now stable and the virtual hosts are up and running: Corn, Compton and Fermi. Another vm is "parked", Landau, but can take over another machine if needed (it is fully working.) Corn is not stable! Most likely because it should NOT run para-virtual.

New problem: Now solved, was config error. I set up virtual hosts corn (32-bit RHEL5) and Fermi (64-bit RHEL4), both went fine. I tried installing compton with 32-bit RHEL4, but the installer keeps crashing someway into the install. Most annoying. Instead, I then installed compton "directly" from a backup. This worked (hurray!) EXCEPT, the system seems to occasionally, well is stops responding to the ssh session. No clue what is going on here.

Note on RHEL4: The RHEL4 system in 32 bit needs to run fully virtual (qemu). It runs plenty fast that way if you use a real disk and not a file. Also note, you can use a partition for a disk instead of a disk in a VM, but it is NOT recommended.

The Non-answer: It cannot be done!This article tells about how to turn a fully virtualized host into a para-virtualized guest, RHEL4, running under an RHEL5 dom0. But in the notes it states that "don't mix and match x86_64 with i686 hosts and visa versa". Red Hat Magazine: Xen Guest for Red Hat Enterprise Linux 4. So we will run Corn fully virtual, which DOES work.

The einstein event

  • Einstein is now on new hardware and running stable.
  • Not yet sure exactly what happened, but on the 7th, starting around 2 AM, root@einstein started receiving cron job errors from the machines, saying "no route to host," "domain not bound," etc. 6:54 AM, mdadm sent a message warning about a degraded array on /dev/md1. At 7:41 AM, both /dev/md0 and /dev/md1 were marked as degraded. At 2:09 PM, einstein sent itself the daily logwatch email, which mentioned a number of lost connections from postfix, and no mdadm messages. From then until 4 AM on the 8th, all machines sent cron errors. At 4 AM, we see gourd list pages of logwatch errors indicating that LDAP was down, while roentgen had a large number of named retry limit errors. At 8 AM, einstein sent mdadm warnings about /dev/md0 and md1 being degraded still. Einstein's logwatch email at 9:30 AM shows LDAP errors in almost every category, as well as ACPI kernel errors. Amavis seems to have been working at that time. At 10:12 AM, clamav on einstein sent email about DNS errors.
  • Presently, einstein appears to be working fully. It would seem that the degraded array messed up the system more and more as time went on, until everything failed to work properly. This leads me to believe that einstein's hardware, with the exception of the hard drives, is good. I'll set up a number of stress tests to run on old einstein and tomato to try to determine the faulty hardware. Good idea to test tomato; it's similar hardware and I'm still suspicious of it after that RAID problem.


Weather

The fan was turned off and the new airconditioner turned on. Room temperatures have been in the 20-22 C range, which is acceptable. I would like another airco (a working one) in the hole where the fan is now. That setup would be a lot safer, especially when it snows. The airco to put in that hole could come from Lorenzo's office.