Recent Config Changes

From Nuclear Physics Group Documentation Pages
Revision as of 00:48, 17 February 2019 by Maurik (talk | contribs) (→‎2018)
Jump to navigationJump to search
Reverse Chronological Order.

2019

  • 2019/02/16 -- change Jupyterhub setup to use sudospawner.
  • 2019/02/16 -- yum update on Taro, Gourd, Endeavour, Pumpkin and node2. Farm is busy, wait with other nodes.

2018

  • 2018/11/13 -- postfix config update. Permit access from specific systems (Jlab) to postfix, even if there is not reverse dns lookup.
  • 2018/11/13 -- *this* wiki: login was not working - needed to install php-ldap module.
  • 2018/10/12 -- Install JypeterHub on Endeavour.
  • 2018/10/01 -- Update the splunk setup on all relevant systems. Gourd is now the aggregator.
  • 2018/09/xx -- Installed JupyterHub on Roentgen, to see if this could server Python notebooks to students. Added one student (angus).
  • 2018/09/18 -- Added rules to dovecot.conf and postfix.conf in /etc/fail2ban/filters.d to remove more spamming and troublesome ip addresses.
  • 2018/09/06 -- New roentgen has Anaconda3 in /usr/local/anaconda3 to server up JupyterHub. Seems to work, but is in testing phase. Needs systemctl script.
  • 2018/09/06 -- Upgraded Roentgen to Centos 7. Trickier than expected, but seems to be a go. Switchover today at 7pm.
  • 2018/08/02-- Run "package-cleanup -y --oldkernels --count=1" on all nodes to get rid of old kernels.
  • 2018/07/14 -- After Lentil seemed all OK, the backup still did not run. Issue: Too many directories, so the disk appears full when it is not, and no new directory could be created. Solution: create a 2016 and 2017 directory and move all those backups into those directories.
  • 2018/07/14 -- Lentil did not make backups since at least 7/7. Problem was link to backend network. Also: added disk: NPG daily 54, a 2TB Seagate drive. I marked the slots that do not seem to work.
  • 2018/07/14 -- Endeavour disk 9 kept running hot. Pulled the drive and now RAID is rebuilding on hot spare.
  • 2018/04/30 -- Noticed that Gourd has an IPMI card, but it is not connected to an ethernet port????? This is a SuperMicro AOC-SIM1U card.
  • 2018/04/30 -- While working on Gourd, we noticed fan noise. Two of the 40x40x28 Sunon fans made noise. These were taken out and replacements are on order.
  • 2018/04/30 -- Gourd issue was that one of the software RAID drives died. At first it looked like *both* were a goner and we would be in trouble. Turns out the other drive was no longer a pass-through. This may be an issue with the controller, since after a complete power off one of the pass-through drives recovered. A new 2TB drive in slot5 and the raid is recovered.
  • 2018/04/27 -- Gourd died. First, the /dev/md# drives became read only, and a reboot just hung the system.
  • 2018/04/xx -- The main UPS died, just died. I got new batteries for it thinking that would fix the problem, and spend a lot of time going back and forth with APC, who could not help. In the end I opened it up, and found that no power came in because the AC plug had melted internally. New plug, now all is OK. :-) Saved $1200.
  • 2018/04/01 -- Do not autostart the Centos pool on Gourd. It makes a dependency to Endeavour's drives and slows down a boot if Endeavour is not reachable. It also creates a dependency with automount working already, which it will not since LDAP is hosted on Einstein, which is a VM on Gourd.
  • 2018/03/30 -- The UPS powering the Gourd, Pumpkin, Taro, Lentil rack is old. The batteries are dying, so much so that it would not power up. *Needed:* Check if new batteries will fix it.

2017

  • 2017/08/28 -- Mediawiki config - point LDAP to "einstein" and no "einstein.farm.physics.unh.edu", else it does not accept the TLS certificate.
  • 2017/08/28 -- Replaced drive #5 in Endeavour. It kept giving "overheating" errors, though the drive feels cool. Probably bad temp sensor.
  • 2017/08/10 -- Copy contents of /var/www from Roentgen to /net/data/www - Later we mount /net/data/www on /var/www for roentgen - readying the webserver duplication.
  • 2017/08/10 -- Export /kvm /www on gourd, and place in auto mount map. ==> /net/data/kvm /net/data/www
  • 2017/08/10 -- Move /kvm on pumpkin to /dev/md123 a RAID1 drive, and export over NFS.
  • 2017/08/10 -- Named replicated on einstein and pepper (both slaves to jalapeno)
  • 2017/08/10 -- Export pumkin:/scratch and let it be automounted to /net/data/scratch
  • 2017/08/10 -- LDAP replication started on Pepper (VM)
  • 2017/08/09 -- Cleanup the /mail on Gourd, delete dirs for people no longer here. Also cleanup the number of groups in LDAP.
  • 2017/08/08 -- Towards upgrading the mail system. chmod 0600 /var/spool/mail/* See: Dovecot ChgrpNoPerm
  • 2017/08/08 -- Upgrade Taro to Centos7. Review of the iptable rules.
  • 2017/08/07 -- Jalapeno authenticates users agains Pepper for testing.
  • 2017/08/02 -- Upgrade Jalapeno to Centos7. Cleaned up version of the named.conf. Jalapeño is now root login only. See jalapeno
  • 2017/08/02 -- Issues with named on Jalapeño. It does not do forwarding correctly. Issue was the IP address for UNH network.
  • 2017/08/01 -- Has it really been that long since anything was done. Yikes.
  • 2017/07/04 -- Install iptables-netgroups on Jalapeño and the new Einstein.
  • 2017/07/04 -- Install Fail2Ban properly on Gourd, new Einstein and Jalapeño.

2016

  • 2016/11/29 -- Extended power down over Thanksgiving break caused big "foo bar" on our main server: Gourd.
    • System would not boot and hang on "systemctl emergency" barf.
    • When logging in as root in the emergency setup, it was clear from /proc/mdstat that the RAIDs were renamed. However, after each mdadm --assemble command, the *%$#! systemctl would reboot.
    • Solution was not so obvious:
  1. Reboot with the OLD kernel and all /dev/md* commented out of /etc/fstab.
  2. Reset the /dev/md* names using "mdadm --manage /dev/md125 --stop", followed by reassembly: "mdadm --assemble /dev/md0 /dev/sdd1 /dev/sdf1" etc. (3 times)
  3. Rebuild the initramfs for the *NEW* kernel: "dracut --force /boot/initramfs-(new kernel number).img (new kernel number) -M
  4. Separate issue was that nfs-server did not start up as expected. "systemctl enable nfs-server"
  5. Reboot - then mop up. Make sure Einstein starts up.
  6. In addition: Einstein VM and Roentgen VM and Jalapeno VM and Corn VM are now set to restart automatically when Gourd/Pumpkin reboot.
  • 2016/06/16 -- The sshd mod from yesterday had the side effect that backups no longer worked. Reason: "lentil.farm.physics.unh.edu" in authorized_keys could not be resolved, since DNS lookups are no longer done. Problem solved by adding 10.0.0.250 to the permitted systems in authorized_keys.
  • 2016/06/15 -- On each of the nodes, point to 10.0.0.100 in ntp.conf and start ntpd. Now they all (mostly) agree on what time it is.
  • 2016/06/15 -- On RHEL 6 or 7 systems, ssh became really slow. The reason was reverse dns lookups, which aren't really needed. I set "UseDNS no" in the sshd_config files and now ssh is fast again.
  • 2016/05/24 -- Move the Mail Home KVM RAIDS from Endeavour back to Gourd. Move Mail RAID
  • 2016/05/04 -- Turn off outside access to DNS on Jalapeno. Jalapeno will now only listen to 10.0.0.* for DNS requests.
  • 2016/05/04 -- Sideways migrate Taro to Centos 5 Sideways Migration from RH to Centos
  • 2016/03/20 -- Setup "epel" on RH7 Pumpkin with: "rpm -Uvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm"
  • 2016/03/10 -- Used proxy to update all software on all nodes, except 10, 11 which were down. Reboot all < 10, 4 and 7 did not return.
  • 2016/03/10 -- Installed squid proxy server on endeavour
  • 2016/03/10 -- Added rafopar to the domain_admins
  • 2016/02/27 -- Lentil started running cron jobs twice (probably since the update). Issue is both cron and anacron running /etc/cron.daily. Disabled anacron from running those jobs.
  • 2016/02/25 -- Jalapeno was used in a DDOS attack. Our DNS (bind) setup was too open, allowing "recursion". Closed it way down to "peers".
  • 2016/02/20 -- Updated all out systems in response to libc vulnerability. Note only RH6+7 are affected, but updated RH5 systems as well. Some, but not all, are also rebooted.
  • 2016/01/05 -- Rebooted Pumpkin on a new MB, Intel i7 CPU, 16 x 4 TB WD drives + 2x WD750 for system in RAID0. System installed is Centos 7

2015

  • 2015/08/31 -- Setup Gourd's new networking for "bridge".
  • 2015/08/28 -- installed the Maui scheduler as the default scheduler. It runs under "maurik" at this point.
  • 2015/08/28 -- Patched the Maui scheduler (code in /data1/System/maui-3.3.1 ) to take command line options properly (i.e. -d)
  • 2015/08/28 -- Fixed LOG issue on the pbs_server. Code is in /data1/System/torque.git file src/server/node_manager.c
  • 2015/08/28 -- Added a list to Einstein fail2ban rules to exclude "lost connection after AUTH from".
  • 2015/08/26 -- uninstall postfix, install sendmail on Endeavour. Postfix is too complicated, all we need is a mail forwarder.
  • 2015/08/26 -- Reinstall Splunk on Taro -- license overruns made it useless.
  • 2015/08/11 -- Started upgrade of Gourd to Centos 7, since it wasn't booting anymore anyway. Ran into trouble with the ethernet driver.
  • 2015/08/11 -- Roentgen aka Nuclear is running again on Endeavor. It is on OpenServer Net.
  • 2015/08/11 -- Einstein now on the OpenServer net. NOTE: They STILL have a tendency to block port 25 (incoming email), so an exception is made of this machine.
  • 2015/06/09 -- Got Torque (Portable Batch System) to run again on Endeavour - Yohoo!
  • 2015/06/09 -- Fixed mail not arriving, since port 25 was blocked again.
  • 2015/06/08 -- Fixed ldap connection on Lentil: MUST connect to einstein.farm.physics.unh.edu
  • 2015/06/08 -- Fixed mounting issues on Centos 6 systems (corn, lentil endeavour, nodes)
  • 2015/06/08 -- Einstein and Jalapeno moved from Gourd to Endeavour.
  • 2015/06/08 -- Properly started sssd on Corn and Jalapeno.
  • 2015/06/08 -- Moved Corn from Gourd to Endeavour.
  • 2015/06/08 -- Bridged all 3 interfaces on Endeavour. br0 = farm, br1= server net, br2 = unh
  • 2015/05/26 -- Reset root passwords.
  • 2015/05/26 -- Einstein mounts /mail from npghome:/mail which now is hosted on Endeavour.
  • 2015/05/26 -- Moved the /home and /mail drives from Gourd to Endeavour, reconstitute as RAID1 on slot 23 (0:0:0:4)=/dev/sde# and slot 24 (0:0:0:5)=/dev/sdf# -- Set backup to backup /home and /mail on Endeavour.
  • 2015/05/19 -- Cleaned up some old user home dirs. Anyone who did not have a login now also does not have a /home or /mail drive. Some users set to /bin/false to be removed later.
  • 2015/05/19 -- Inserted a 1TB drive in Gourd and set it up to mirror /home /mail and /kvm as it is supposed to be RAID1. No clue what happened with the original mirror disk.
  • 2015/05/19 -- Tried to get backup emails to work again on Lentil. We'll see.
  • 2015/05/14 -- Virtualization stopped on Lentil. Not needed on backup system.
  • 2015/05/14 -- Splunk running on Taro, Endeavour, Gourd, Einstein, Roentgen, Lentil
  • 2015/04/15 -- Fail2ban running on Endeavour.
  • 2015/04/15 -- Endeavour web server resurrected. Ganglia still needed.
  • 2015/04/14 -- Splunk restarted on Taro, setup on Einstein (forwarding to Taro).
  • 2015/04/09 -- Roentgen and Nuclear moved to Taro and Open Server net. MySQL DB copied over in final state. (so now you are looking at the new Roentgen.)
  • 2015/04/08 -- Node cloning in progress. See How to Clone a Node
  • 2015/04/08 -- Move to Open Server Net in progress. See Move To Open Server Net
  • 2015/04/03 -- Taro is now on Server-Open network with new IP 132.177.180.86. Endeavour is on 132.177.180.225.
  • 2015/04/02 -- Endeavour upgrade documented at Upgrading Endeavour --Node 2 is nearly done.
  • 2015/04/02 -- The backups are running but still not sending mail. It seems NO ONE IS LOOKING AT THIS! (that is a good way to piss me off.)
  • 2015/03/12 -- MailMan E-mail is not working properly. Changed backup cron job (/etc/cron.daily/0rsync_backup) to send email directly.
  • 2015/03/08 -- Gourd CentOS 6 Migration
  • 2015/03/02 -- SpamAssassin and E-mail -- misconfigured postfix did not run spam properly.
  • 2015/03/01 -- SpamAssassin Updated documentation how to filter spam better.
  • 2015/02/15 -- taro updated the globus toolkit for data transfers to/from Jlab.
  • 2015/02/03 -- lentil Changed the network-scripts. Lentil was still trying to use the VLAN, while it was directly connected to the UNH network. (Shame on us!) Also, configured sssd to contact einstein.farm.physics.unh.edu instead of einstein.unh.edu.

2014

2014/11/29 -- einstein Dovecot.conf:45 disable_plaintext_auth = yes 2014/11/29 -- einstein Changed the NTP servers to ns1.unh.edu, ns2.unh.edu, nic.unh.edu, since these actually work!
2014/11/04 -- einstein taro Iptables have a new, long, blacklist.
2014/11/04 -- einstein Changed Postfix authentication module from smtpd to dovecot. This fixes the issue with postfix claiming authentication methods which don't actually work.
2014/10/05 -- corn and jalapeno Change: Fully transitioned RHEL repositories and packages to their equivelant CentOS versions. Both changes required downloading and installing CentOS's repository keys, removing all packages that start with rhn (replaced with the CentOS versions), then cleaning all cached packages and running a standard yum upgrade. Details on what commands used can be found at http://knowledgelayer.softlayer.com/procedure/convert-redhat-centos
2014/09/26 -- einstein Change: Stop the avahi-daemon service and take it out of the automatically started services. Avahi-daemon implements Apple's "bonjour" protocols, which I don't think we need.

Older Pages with Completed tasks

Completed in Jan/Feb/Mar 2009
Completed in July/Aug/Sep/Oct/Nov 2008
Completed in March/April/May/June 2008
Completed in February 2008
Completed in January 2008
Completed in November/December 2007
Completed in October 2007
Completed in September 2007
Completed in August 2007
Completed in July 2007
Completed in June 2007