Difference between revisions of "Sysadmin Todo List"

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search
Line 33: Line 33:
 
* Eventually one day come up with a plan to deal with pauli2's kernel issue
 
* Eventually one day come up with a plan to deal with pauli2's kernel issue
 
* Look into making a centralized interface to monitor/maintain all the machines at once. '''Along the same lines: Continue homogenizing the configurations of the machines.'''
 
* Look into making a centralized interface to monitor/maintain all the machines at once. '''Along the same lines: Continue homogenizing the configurations of the machines.'''
 +
* Figure out why jalapeno doesn't have 3dm sofware running.  If we find that there's no good reason, maybe we should install it?
 
* Certain settings are the similar or identical for all machines.  It would be beneficial to write a program to do remote configuration.  This would also simplify the process of adding/upgrading machines.
 
* Certain settings are the similar or identical for all machines.  It would be beneficial to write a program to do remote configuration.  This would also simplify the process of adding/upgrading machines.
 
* Update Tomato to RHEL5 and check all services einstein currently provides. Then switch einstein <-> tomato, and then upgrade what was originally einstein. Look into making an einstein, tomato failsafe setup.
 
* Update Tomato to RHEL5 and check all services einstein currently provides. Then switch einstein <-> tomato, and then upgrade what was originally einstein. Look into making an einstein, tomato failsafe setup.

Revision as of 12:16, 27 June 2007

General Info

This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under Completed.

Important

  • Find out why Steve isn't getting paid what he's supposed to be getting paid.
  • Nobody is currently reading the mail that is send to "root". Einstein had 3000+ unread messages. I deleted almost all. There are some useful messages that are send to root with diagnostics in them, we should find a solution for this. Temporarily, both Matt and Steve have email clients set up to access root's account.
  • Enable SMP on lentil
  • Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu Seems like we need info from the Konica guy to get it set up on Red Hat and OS X. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry. Mac OS X now working, IT guy should be here week of June 26th
  • Figure out what network devices on tomato are doing what
  • Look into monitoring RAID, disk usage, etc.
  • Look into getting computers to pull scheduled updates from rhn when they check in.
  • Need to get onto the "backups" shared folder, as well as be added as members to the lists. "backups" wasn't even a mailing list, according to the Mailman interface.
  • Figure out exactly what our backups are doing, and see if we can implement some sort of NFS user access. NPG_backup_on_Lentil.
  • I set up "splunk" on einstein (production 2.2.3 version) and taro (beta 3 v2). I like the beta's functionality better, but it has a memory leak. Look for update to beta that fixes this and install update. (See: www.splunk.com/base/forum:SplunkGeneral While this sounds like it could only be indirectly related to our issue, it does sound close enough and is the only official word on splunk's memory usage that I could find:[1]
    When forwarding cooked data you may see the memory usage spike and kill the splunkd process. This should be fixed for beta 3.
    So, waiting for the next beta or later sounds like the best bet. I'm wary of running beta software on einstein, anyhow.
  • Learn how to use cacti on okra. Seems like a nice tool, mostly set up for us already.
  • Find out why lentil and okra (and tomato?) aren't being read by cacti. Could be related to the warnings that repeat in okra:/var/www/cacti/log/cacti.log
  • Learn how to set up evolution fully so we can support users. Need LDAP address book.
  • Matt's learning a bit of Perl so we can figure out exactly how the backup works, as well as create more programs in the future, specifically thinking of monitoring. Look into the CPAN modules under Net::, etc.
  • Figure out what happened to lentil's Perl binary. System logs don't show any obviously malicious logins, etc. My current suspicion is that a typo in some other script led to it (maybe something like > when they meant |).

Ongoing

  • Maintain the Documentation of all systems!
    • Main function
    • Hardware
    • OS
    • Network
  • Clean up 202
    • Figure out what's worth keeping
    • Figure out what doesn't belong here
  • Take a look at spamassassin - Improve Performance of this if possible.
  • Test unknown equipment:
    • UPS
  • Printer in 323 is not hooked up to a dead network port. Actually managed to ping it. One person reportedly got it to print, nobody else has, and that user has been unable ever since. Is this printer dead? We need to find out.
  • Eventually one day come up with a plan to deal with pauli2's kernel issue
  • Look into making a centralized interface to monitor/maintain all the machines at once. Along the same lines: Continue homogenizing the configurations of the machines.
  • Figure out why jalapeno doesn't have 3dm sofware running. If we find that there's no good reason, maybe we should install it?
  • Certain settings are the similar or identical for all machines. It would be beneficial to write a program to do remote configuration. This would also simplify the process of adding/upgrading machines.
  • Update Tomato to RHEL5 and check all services einstein currently provides. Then switch einstein <-> tomato, and then upgrade what was originally einstein. Look into making an einstein, tomato failsafe setup.

Completed

  • Get ahold of a multimeter so we can test supplies and cables. Got a tiny portable one.
  • Order new power supply for Taro Ordered from Newegg 5/24/2007
  • Weed out unneccesary cables, we don't need a full box of ata33 and another of ata66. Consolodated to one box
  • Installed new power supply for Taro.
  • Get printer (Myriad) working.
  • Set up skype.
  • Fix sound on improv so we can use skype and music. sound was set to alsa, setting to OSS fixed it. should have worked either way though.
  • Check out 250gb sata drive found in old maxtor box. Clicks/buzzes when powered up.
  • Look into upgrades/patches for all our systems. Scheduled critical updates to be applied to all machines but einstein. If they don't break the other machines, like they didn't break the first few, I'll apply them to einstein too.
  • Started download of fedora 7 for the black computer. 520KB/s downstream? Wow.
  • Get Pauli computers up. I think they're up. They're plugged in and networked.
  • Find out what the black computer's name is! We called it blackbody.
  • "blackbody" is Currently online. For some reason, grub wasn't properly installed in the MBR by the installer.
  • Consolidate backups to get a drive for gourd. Wrote a little script to get the amanda backup stuff into a usable state, taking up WAY less space.
  • Submitted DNS request for blackbody. blackbody.unh.edu is now online
  • Label hard disks Labeled all that we knew what they were!
  • Labeled machines' farm Ethernet ports
  • Made RHEL5 server discs.
  • Replaced failed drive in Gourd - 251gb maxtor sata. Apparently the WD drives are 250, maxtor are 251.
  • Repair local network connection on Gourd.
  • Repair LDAP on Gourd (probably caused by net connection). Replacing the drive fixed every gourd problem!! Seems to have been related to the lack of space on the RAID when a disk was missing. If not that, IIAM (It is a mystery).
  • New combo set on door for 202.
  • Tested old PSU's, network cables, and fans.
  • All machines now report to rhn. None of them pull scheduled updates though. Client-side issue?
  • Documentation for networking - we have no idea which config files are actually doing anything on some machines. Pretty much figured out, but improv and ennui still aren't reachable via farm IPs. But considering the success of getting blackbody set up from scratch, it seems that we know enough to maintain the machines that are actually part of the farm.
  • Make network devices eth0 and eth0.2 on all machines self-documenting by giving them aliases "farm" and "unh" jalapeno and tomato remain
  • Figure out why pauli nodes don't like us. (Low-priority!) Aside from pauli2's kernel issue this is taken care of
  • Learned how to use the netboot stuff on tomato.
  • Set up the nuclear.unh.edu web pages to serve up the very old but working pages instead of the new but broken XML ones.
  • Scheduled downtime to install taro's power supply.
  • Successfully installed Fedora 7 on ennui. Getting this and blackbody set up leads me to believe that we have a good grasp on the configuration of the network, authentication, VLAN, etc.
  • Installed power supply in taro. A bit crooked since the mounting is non-standard, but it's secure.
  • Where is the backup script? Look in /usr/local/bin Matt found it: rsync_backup.pl
  • Set up 3dm raid manager on gourd with remote access. The other machines don't have this tool installed.
  • Added NPG-Daily-28 drive to lentil.
  • Enabled SMP on jalapeno