Difference between revisions of "Upgrading Endeavour"

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search
 
(48 intermediate revisions by the same user not shown)
Line 1: Line 1:
Upgrading Endeavour nodes to Centos 6 started March 17, 2015 with node2:
+
= Upgrade Nodes to Centos 7 =
 +
 
 +
* Starting with Node2, which would not boot anyway since it kernel panicked.
 +
** Use USB boot stick.
 +
** - Configure "Compute Node" with legacy libraries, and NFS.
 +
** Edit the /etc/sysconfig/networking-scripts/ifcfg-enp5s0f0  to the address 10.0.0.2 netmask=255.255.0.0 gateway=10.0.0.100
 +
** Startup network.
 +
** Copy the dir /etc/ssh for ssh-keys.
 +
** Setup up /etc/yum.conf to use 10.0.0.100 as proxy.
 +
** Setup up /etc/resolv.conf to use our name servers.
 +
** install emacs
 +
** Set the hostname in /etc/hostname
 +
** Copy (node10) /etc/hosts
 +
** Copy/set /etc/resolve.conf
 +
** Setup LDAP and SSS (see [[Upgrading to Centos 7]] )
 +
 
 +
 
 +
 
 +
= Install Torque =
 +
 
 +
*Follow: [http://docs.adaptivecomputing.com/torque/6-1-1/adminGuide/help Toque Admin Guide]
 +
* On gourd, using /net/data/Torque
 +
**Get code: wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.1.1.tar.gz
 +
**Pre-reqs: yum install -y libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++
 +
**Install hwloc: yum install -y hwloc
 +
** ./configure --with-default-server=endeavour
 +
** make -j 8
 +
** make install
 +
** make packages
 +
** Copy the old /var/spool/torque back.
 +
** systemctl enable pbs_server.service and systemctl start pbs_server.service
 +
** Copy torque-package-mom-linux-x86_64.sh to slave node
 +
** install with: torque-package-mom-linux-x86_64.sh --install
 +
** Start mom: systemctl enable pbs_mom.service &&  systemctl start pbs_mom.service
 +
 
 +
= Upgrade main system =
 +
 
 +
Started with a sideways migration to Centos 5. This worked except for the infiniband packages, which were skipped:
 +
* Bump Endeavour over to CENTOS 5: see http://wiki.centos.org/HowTos/MigrationGuide
 +
** Execute: rpm -ivh http://mirror.centos.org/centos/5/os/i386/CentOS/centos-release-5-11.el5.centos.i386.rpm http://mirror.centos.org/centos/5/os/i386/CentOS/centos-release-notes-5.11-0.i386.rpm http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-5-11.el5.centos.x86_64.rpm http://mirror.centos.org/centos/5/os/x86_64/CentOS/centos-release-notes-5.11-0.x86_64.rpm http://mirror.centos.org/centos/5/os/x86_64/CentOS/redhat-logos-4.9.99-11.el5.centos.noarch.rpm
 +
** Execute: yum update --skip-broken
 +
** Remove the broken packages by hand. Unfortunately, this took down the system :-)
 +
*** rpm -e --allmatches ibsim  ibutils ibutils-libs infiniband-diags libibcommon libibcommon-devel libibcommon-static libibmad libibmad-devel libibmad-static libibumad libibumad-devel libibumad-static  opensm opensm-devel opensm-libs  opensm-static  perftest srptools mvapich_gcc mvapich2_gcc
 +
 
 +
New system @ Centos 6.6:
 +
* Reconfigured RAID. All the 2TB drives are now in slots 1-9 and configured for a RAID6, 14TB raid.
 +
** Slots 10,11,12 will be hot-spare, and 2x passthrough.
 +
*** The passthrough are for: slot11 - Can contain Home directory drive when Gourd is being upgraded. slot12 -- OldSys a 1TB drive with the old Centos 5.5 system.
 +
** There are 3 volumes on the RAID: "system" ~ 100GB, "system2" ~100GB, "data1"
 +
** Remaining 12 slots will be filled with high density new drives for another RAID6
 +
* Restarted web server.
 +
* Setup fail2ban
 +
 
 +
= Installing new RPMs on Nodes =
 +
 
 +
* The Centos DVDs are installed at /net/data/node10/RHEL/Centos-6.6 
 +
* This dir is added to the c6-media in /etc/yum.repos.d/CentOS-Media.repo
 +
* Install packages with: yum --disablerepo \* --enablerepo c6-media  install package_name
 +
 
 +
More recently, Endeavour is now a proxy server and yum on the nodes is setup to make use of the proxy. Thus "yum update" simply works.
 +
 
 +
= Upgrading Endeavour nodes =
 +
 
 +
Upgrade to Centos 6 started March 17, 2015 with node2:
 
* Reboot Node2 from a USB key with Centos6 distribution installed. Chose "minimal install"
 
* Reboot Node2 from a USB key with Centos6 distribution installed. Chose "minimal install"
 
** Note: Should have added scp, i.e openssh-client stuff. Added this "by hand" by using from endeavour: cat openssh-client-... | ssh node2 "cat - > openssh-client.rpm"  and then installing that rpm.
 
** Note: Should have added scp, i.e openssh-client stuff. Added this "by hand" by using from endeavour: cat openssh-client-... | ssh node2 "cat - > openssh-client.rpm"  and then installing that rpm.
Line 6: Line 69:
 
* Install packages with: yum --disablerepo \* --enablerepo c6-media  install  
 
* Install packages with: yum --disablerepo \* --enablerepo c6-media  install  
 
* List of old package installed are in ~root/new_packages.txt with the distribution and package version stripped already. From this list, the packages were filtered into "installed" and "available" with yum. From the resulting list of "available" only the x86_64 and noarch packages were installed.
 
* List of old package installed are in ~root/new_packages.txt with the distribution and package version stripped already. From this list, the packages were filtered into "installed" and "available" with yum. From the resulting list of "available" only the x86_64 and noarch packages were installed.
 
+
* A number of config tweaks needed.
 +
* Nodes @ Centos 6.6
 +
** nodes:  2,3,7, 11, 13
 
== Done ==
 
== Done ==
 
* Initial installation of system
 
* Initial installation of system
 
* Update the ethernet configuration (eth0) to 10.0.0.2 and onboot=yes.
 
* Update the ethernet configuration (eth0) to 10.0.0.2 and onboot=yes.
 
* Install all previously installed packages
 
* Install all previously installed packages
 +
* setup ssh for passwordless entry. See: http://itg.chem.indiana.edu/inc/wiki/software/openssh/189.html or http://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication
 +
** Note: Endeavour identifies as master.farm.physics.unh.edu, and so this needs to be in the ssh_known_hosts file and in the shosts.equiv file and .shosts file.
 +
* Configure authentication: see [[SSSD]]
 +
* Synchronise clock with endeavour: rdate -s endeavour && hwclock --systohc
 +
* Configure automount: Frankly, it is a mystery why it works, since the setup seems incomplete, but hey, I won't complain.
 +
* Configure Infiniband
 +
** See: http://pkg-ofed.alioth.debian.org/howto/infiniband-howto-4.html  includes a number of tests. All passed.
 +
** See: https://pkg-ofed.alioth.debian.org/howto/infiniband-howto-5.html  -- tested OK.
 +
** See: https://access.redhat.com/solutions/301643
 +
** Note: The ib0 does not seem to come up automatically? Needs to be checked.
 +
* Disk cloning: Run  /sbin/Node_Clone.sh 2 7  from node2 to clone from node2 to node7, when both disks are in node2. Currently node2 and node3 are setup for cloning.
 +
* Setup Splunk
 +
* PBS scheduler
 +
** Still need to setup more queues.
 +
 +
== Node Cloning Recipe -- Adopted for Centos 7 -- ==
 +
 +
'''Caveat Empor''' -- do not follow these steps blindly. These are notes from the procedure in 2015.
 +
 +
<code>
 +
=== Prepare drive: ===
 +
# If there is no label, or a new drive, or you did "dd if=/dev/zero of=/dev/sdb bs=1M bs=1000" to wipe it:
 +
  parted /dev/sdb mklabel msdos
 +
# else remove the old partitions 
 +
# create new partitions:
 +
  parted /dev/sdb mkpart primary 1049kb 251MB
 +
  parted /dev/sdb mkpart primary 251MB 100%
 +
  parted /dev/sdb toggle 1 boot
 +
  parted /dev/sdb toggle 2 lvm
 +
  # parted /dev/sdb set 1 bios_grub on # Needed if we chose gtp label, but grub did not like that.
 +
=== Make the  Boot Drive: ===
 +
  mkfs.ext4 /dev/sdb1
 +
  e2label /dev/sdb1 boot
 +
  test -r /mnt/boot || mkdir -p /mnt/tmp_boot
 +
  mount /dev/sdb1 /mnt/tmp_boot
 +
  rsync  -avxHAX --numeric-ids /boot/  /mnt/tmp_boot
 +
# Check to make sure it is all there.
 +
# Rename node stuff -- This is Centos 6 only.
 +
  cd /mnt/tmp_boot/grub
 +
  sed 's/node2/node3/g' grub.conf  > tmp.conf && mv grub.conf grub.conf.orig && mv tmp.conf grub.conf
 +
  sed 's/node2/node3/g' menu.lst  > tmp.lst && mv menu.lst menu.lst.orig && mv tmp.lst menu.lst
 +
  cd /
 +
  umount /dev/sdb1
 +
# Create Logical volume sets:
 +
# Is this does not work, it can be that 'old' lvm partitions are recognized. Use ls -l /dev/mapper to find out which and then destroy them. You can delete all the links, and then finally use 'dmsetup remove <diskname>' to get rid of them.
 +
 +
  pvcreate /dev/sdb2   
 +
  vgcreate centos_node3 /dev/sdb2
 +
  lvcreate -L 16G  -n swap  centos_node3
 +
  lvcreate -l 100%FREE  -n root  centos_node3
 +
  mkswap -L swap /dev/centos_node3/swap
 +
  mkfs.ext4 -L root /dev/centos_node3/root
 +
===  Copy data: ===
 +
  cd /
 +
  test -r /mnt/tmp_root || mkdir -p /mnt/tmp_root
 +
  mount /dev/centos_node3/root /mnt/tmp_root
 +
  mkdir  /mnt/tmp_root/proc /mnt/tmp_root/dev /mnt/tmp_root/run /mnt/tmp_root/tmp
 +
  chmod 1777 /mnt/tmp_root/tmp
 +
  chmod  555 /mnt/tmp_root/proc
 +
  mkdir  /mnt/tmp_root/net  /mnt/tmp_root/srv  /mnt/tmp_root/sys /mnt/tmp_root/boot /mnt/tmp_root/data
 +
  rsync  -avxHAX --numeric-ids bin etc lib lib64 media opt root sbin usr var /mnt/tmp_root
 +
 +
=== Prep BOOT for grubbing ===
 +
  mount /dev/sdb1 /mnt/tmp_root/boot
 +
  mount -o bind /dev /mnt/tmp_root/dev
 +
  mount -o bind /proc /mnt/tmp_root/proc
 +
  mount -o bind /sys  /mnt/tmp_root/sys
 +
  mount -o bind /run  /mnt/tmp_root/run
 +
  chroot /mnt/tmp_root
 +
 +
=== Grub drive ===
 +
# You may need to run this to reinstall the grub stuff: yum reinstall grub2-tools
 +
  dracut -f                                # Create new initram disk.
 +
  grub2-install /dev/sdb
 +
  grub2-mkconfig -o /boot/grub2/grub.cfg
 +
* If you get an error: "WARNING: Failed to connect to lvmetad. Falling back to device scanning." then /run was not properly bound to /mnt/tmp_root/run so lvm is not working properly
 +
* If you get "grub2-install: warning: this GPT partition label contains no BIOS Boot Partition; embedding won't be possible." then you set a "boot" partition, but not a "bios_boot" partition. Execute: "parted /dev/sdb set 1 bios_grub on"
 +
 +
# Exit chroot
 +
  exit
 +
 +
=== Fixup /etc ===
 +
  cd /mnt/tmp_root/etc
 +
# fixup fstab sysconfig/network sysconfig/network-scripts
 +
  sed 's/node2/node3/g;' fstab > tmp.tmp && mv fstab fstab.orig && mv tmp.tmp fstab
 +
  cd sysconfig
 +
  cd network-scripts
 +
  sed 's/10.0.0.2/10.0.0.3/g;/HWADDR/d;/UUID/d' ifcfg-eth0 > tmp.tmp && mv ifcfg-eth0 xx_ifcfg-eth0.orig && mv tmp.tmp ifcfg-enp5s0f0
 +
  sed 's/10.1.0.2/10.1.0.3/g;/HWADDR/d;/UUID/d' ifcfg-ib0 > tmp.tmp && mv ifcfg-ib0 xx_ifcfg-ib0.orig && mv tmp.tmp ifcfg-ib0
 +
 +
=== Closeout ===
 +
  cd /
 +
  umount /mnt/tmp_root/boot
 +
  umount /mnt/tmp_root/dev
 +
  umount /mnt/tmp_root/proc
 +
  umount /mnt/tmp_root/sys
 +
  umount /mnt/tmp_root/run
 +
 +
  umount /mnt/tmp_root
 +
  echo “ALL DONE - Shutdown the mother node, take out the 2nd hard drive, put it in the destiny node and boot that node. Repeat for next node.”
 +
       
 +
</code>
  
 
== To Do ==
 
== To Do ==
  
 
* Configure MPI
 
* Configure MPI
* Configure Infiniband
+
** Later. I'm not sure anyone is using this right now.
* Configure open PBS
+
* Reconfigure Ganglia
* lost more
 

Latest revision as of 01:53, 16 January 2018

Upgrade Nodes to Centos 7

  • Starting with Node2, which would not boot anyway since it kernel panicked.
    • Use USB boot stick.
    • - Configure "Compute Node" with legacy libraries, and NFS.
    • Edit the /etc/sysconfig/networking-scripts/ifcfg-enp5s0f0 to the address 10.0.0.2 netmask=255.255.0.0 gateway=10.0.0.100
    • Startup network.
    • Copy the dir /etc/ssh for ssh-keys.
    • Setup up /etc/yum.conf to use 10.0.0.100 as proxy.
    • Setup up /etc/resolv.conf to use our name servers.
    • install emacs
    • Set the hostname in /etc/hostname
    • Copy (node10) /etc/hosts
    • Copy/set /etc/resolve.conf
    • Setup LDAP and SSS (see Upgrading to Centos 7 )


Install Torque

  • Follow: Toque Admin Guide
  • On gourd, using /net/data/Torque
    • Get code: wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.1.1.tar.gz
    • Pre-reqs: yum install -y libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++
    • Install hwloc: yum install -y hwloc
    • ./configure --with-default-server=endeavour
    • make -j 8
    • make install
    • make packages
    • Copy the old /var/spool/torque back.
    • systemctl enable pbs_server.service and systemctl start pbs_server.service
    • Copy torque-package-mom-linux-x86_64.sh to slave node
    • install with: torque-package-mom-linux-x86_64.sh --install
    • Start mom: systemctl enable pbs_mom.service && systemctl start pbs_mom.service

Upgrade main system

Started with a sideways migration to Centos 5. This worked except for the infiniband packages, which were skipped:

New system @ Centos 6.6:

  • Reconfigured RAID. All the 2TB drives are now in slots 1-9 and configured for a RAID6, 14TB raid.
    • Slots 10,11,12 will be hot-spare, and 2x passthrough.
      • The passthrough are for: slot11 - Can contain Home directory drive when Gourd is being upgraded. slot12 -- OldSys a 1TB drive with the old Centos 5.5 system.
    • There are 3 volumes on the RAID: "system" ~ 100GB, "system2" ~100GB, "data1"
    • Remaining 12 slots will be filled with high density new drives for another RAID6
  • Restarted web server.
  • Setup fail2ban

Installing new RPMs on Nodes

  • The Centos DVDs are installed at /net/data/node10/RHEL/Centos-6.6
  • This dir is added to the c6-media in /etc/yum.repos.d/CentOS-Media.repo
  • Install packages with: yum --disablerepo \* --enablerepo c6-media install package_name

More recently, Endeavour is now a proxy server and yum on the nodes is setup to make use of the proxy. Thus "yum update" simply works.

Upgrading Endeavour nodes

Upgrade to Centos 6 started March 17, 2015 with node2:

  • Reboot Node2 from a USB key with Centos6 distribution installed. Chose "minimal install"
    • Note: Should have added scp, i.e openssh-client stuff. Added this "by hand" by using from endeavour: cat openssh-client-... | ssh node2 "cat - > openssh-client.rpm" and then installing that rpm.
  • SSH into the system
  • Copy the Centos ISO to node2 with scp. Mount on /mnt/centos
  • Install packages with: yum --disablerepo \* --enablerepo c6-media install
  • List of old package installed are in ~root/new_packages.txt with the distribution and package version stripped already. From this list, the packages were filtered into "installed" and "available" with yum. From the resulting list of "available" only the x86_64 and noarch packages were installed.
  • A number of config tweaks needed.
  • Nodes @ Centos 6.6
    • nodes: 2,3,7, 11, 13

Done

Node Cloning Recipe -- Adopted for Centos 7 --

Caveat Empor -- do not follow these steps blindly. These are notes from the procedure in 2015.

Prepare drive:

  1. If there is no label, or a new drive, or you did "dd if=/dev/zero of=/dev/sdb bs=1M bs=1000" to wipe it:
 parted /dev/sdb mklabel msdos
  1. else remove the old partitions
  2. create new partitions:
 parted /dev/sdb mkpart primary 1049kb 251MB
 parted /dev/sdb mkpart primary 251MB 100%
 parted /dev/sdb toggle 1 boot
 parted /dev/sdb toggle 2 lvm
 # parted /dev/sdb set 1 bios_grub on # Needed if we chose gtp label, but grub did not like that.

Make the Boot Drive:

 mkfs.ext4 /dev/sdb1
 e2label /dev/sdb1 boot
 test -r /mnt/boot || mkdir -p /mnt/tmp_boot
 mount /dev/sdb1 /mnt/tmp_boot
 rsync  -avxHAX --numeric-ids /boot/  /mnt/tmp_boot
  1. Check to make sure it is all there.
  2. Rename node stuff -- This is Centos 6 only.
 cd /mnt/tmp_boot/grub
 sed 's/node2/node3/g' grub.conf  > tmp.conf && mv grub.conf grub.conf.orig && mv tmp.conf grub.conf
 sed 's/node2/node3/g' menu.lst  > tmp.lst && mv menu.lst menu.lst.orig && mv tmp.lst menu.lst
 cd /
 umount /dev/sdb1
  1. Create Logical volume sets:
  2. Is this does not work, it can be that 'old' lvm partitions are recognized. Use ls -l /dev/mapper to find out which and then destroy them. You can delete all the links, and then finally use 'dmsetup remove <diskname>' to get rid of them.
 pvcreate /dev/sdb2     
 vgcreate centos_node3 /dev/sdb2
 lvcreate -L 16G  -n swap  centos_node3
 lvcreate -l 100%FREE  -n root  centos_node3
 mkswap -L swap /dev/centos_node3/swap 
 mkfs.ext4 -L root /dev/centos_node3/root

Copy data:

 cd /
 test -r /mnt/tmp_root || mkdir -p /mnt/tmp_root
 mount /dev/centos_node3/root /mnt/tmp_root
 mkdir  /mnt/tmp_root/proc /mnt/tmp_root/dev /mnt/tmp_root/run /mnt/tmp_root/tmp
 chmod 1777 /mnt/tmp_root/tmp
 chmod  555 /mnt/tmp_root/proc
 mkdir   /mnt/tmp_root/net  /mnt/tmp_root/srv  /mnt/tmp_root/sys /mnt/tmp_root/boot /mnt/tmp_root/data
 rsync  -avxHAX --numeric-ids bin etc lib lib64 media opt root sbin usr var /mnt/tmp_root

Prep BOOT for grubbing

 mount /dev/sdb1 /mnt/tmp_root/boot
 mount -o bind /dev /mnt/tmp_root/dev
 mount -o bind /proc /mnt/tmp_root/proc
 mount -o bind /sys  /mnt/tmp_root/sys
 mount -o bind /run  /mnt/tmp_root/run
 chroot /mnt/tmp_root

Grub drive

  1. You may need to run this to reinstall the grub stuff: yum reinstall grub2-tools
 dracut -f                                # Create new initram disk.
 grub2-install /dev/sdb
 grub2-mkconfig -o /boot/grub2/grub.cfg
  • If you get an error: "WARNING: Failed to connect to lvmetad. Falling back to device scanning." then /run was not properly bound to /mnt/tmp_root/run so lvm is not working properly
  • If you get "grub2-install: warning: this GPT partition label contains no BIOS Boot Partition; embedding won't be possible." then you set a "boot" partition, but not a "bios_boot" partition. Execute: "parted /dev/sdb set 1 bios_grub on"
  1. Exit chroot
 exit

Fixup /etc

 cd /mnt/tmp_root/etc
  1. fixup fstab sysconfig/network sysconfig/network-scripts
 sed 's/node2/node3/g;' fstab > tmp.tmp && mv fstab fstab.orig && mv tmp.tmp fstab
 cd sysconfig
 cd network-scripts
 sed 's/10.0.0.2/10.0.0.3/g;/HWADDR/d;/UUID/d' ifcfg-eth0 > tmp.tmp && mv ifcfg-eth0 xx_ifcfg-eth0.orig && mv tmp.tmp ifcfg-enp5s0f0
 sed 's/10.1.0.2/10.1.0.3/g;/HWADDR/d;/UUID/d' ifcfg-ib0 > tmp.tmp && mv ifcfg-ib0 xx_ifcfg-ib0.orig && mv tmp.tmp ifcfg-ib0

Closeout

 cd /
 umount /mnt/tmp_root/boot
 umount /mnt/tmp_root/dev
 umount /mnt/tmp_root/proc
 umount /mnt/tmp_root/sys
 umount /mnt/tmp_root/run
 umount /mnt/tmp_root
 echo “ALL DONE - Shutdown the mother node, take out the 2nd hard drive, put it in the destiny node and boot that node. Repeat for next node.”
       

To Do

  • Configure MPI
    • Later. I'm not sure anyone is using this right now.
  • Reconfigure Ganglia