Sysadmin Checklist

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search

These are the most basic things you need to do, if you can every day.

  1. Check the backups:
    1. Did backup run? (you should have gotten an email)
    2. Did all the machines, especially Gourd, Einstein, Roentgen, get backed up completely?
    3. Do we need to insert another disk and file an old one?
  2. Check the mail system:
    1. Is einstein up and can port 25 be reached (i.e. nc -z -w5 25; echo $? returns "succeeded")
    2. If failtoban or denyhosts runing properly?
    3. Check the einstein /var/log/maillog and /var/log/messages. Anything odd, any errors, break in attempts that may have been successful?
      1. This is where spunk can be really helpful, unfortunately nobody has been maintaining spunk.
      2. Don't kill yourself looking at the log, but please do check them at times for oddities. Make a log of the oddities that are actually normal so others know what to look for. Link that log here.
  3. Check VMs
    1. Are the VMs running on Gourd?
    2. Is there sufficient disk space for / on all machines?
    3. What are the CPU use levels? Are they normal, is there a machine using too much resources, CPU or memory?
  4. Check LDAP:
    1. Does the LDAP server (einstein) return required information quickly?
  5. Check the DNS server(s):
    1. Does "nslookup taro" return while "nslookup" returns quickly with If it takes too long, named is not working properly.
  6. Check the Web Server
    1. Is it running, is it reachable, it is getting attacked?
  7. Server room:
    1. Occasionally go to the server room. Is the airco working, any beeping noises, all machines powered, no water on the floor, no fire.....