https://nuclear.unh.edu/wiki/api.php?action=feedcontributions&user=Steve&feedformat=atomNuclear Physics Group Documentation Pages - User contributions [en]2024-03-29T05:50:28ZUser contributionsMediaWiki 1.35.0https://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3778Einstein Status2008-06-19T19:47:28Z<p>Steve: /* Mail server */</p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Need to get TLS working. -It's been set in main.cf, but hasn't been applied. We need to make sure the certs are set up right.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] - Done!<br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein.<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen, except squirrelmail, which requires a basic apache setup.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
Sudo is currently hanging on einstein2. We really should work this out before even considering the switch. Update: It's not indefinite, though. After a long time, I got this message '''"sudo: uid 4235 does not exist in the passwd file!"''' Which is shown to be untrue when I do a <code>getent passwd 4235</code>. /etc/ldap.conf should have said "ssl no" instead of "ssl start_tls", in order to match old einstein's setup. It now works.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein] ("cyrus2courier" seems like the best and not any harder to use than the other two options. No, it isn't. It doesn't work for 2.2+)<br />
<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3774Einstein Status2008-06-19T15:06:43Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Need to get TLS working. -It's been set in main.cf, but hasn't been applied. We need to make sure the certs are set up right.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein.<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen, except squirrelmail, which requires a basic apache setup.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
Sudo is currently hanging on einstein2. We really should work this out before even considering the switch. Update: It's not indefinite, though. After a long time, I got this message '''"sudo: uid 4235 does not exist in the passwd file!"''' Which is shown to be untrue when I do a <code>getent passwd 4235</code>.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein] ("cyrus2courier" seems like the best and not any harder to use than the other two options)<br />
<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3769Einstein Status2008-06-17T16:51:53Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Need to get TLS working.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein.<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein] ("cyrus2courier" seems like the best and not any harder to use than the other two options)<br />
<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3768Einstein Status2008-06-17T16:50:21Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Done!<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein.<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein] ("cyrus2courier" seems like the best and not any harder to use than the other two options)<br />
<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3767Einstein Status2008-06-17T15:58:20Z<p>Steve: /* Mail server */</p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Can send and receive locally, but remote e-mails are send-only.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein. '''rsyncing them to einstein2 now, it'll probably take a while'''<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein] ("cyrus2courier" seems like the best and not any harder to use than the other two options)<br />
<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3766Einstein Status2008-06-17T15:38:23Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Can send and receive locally, but remote e-mails are send-only.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein. '''rsyncing them to einstein2 now, it'll probably take a while'''<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein]<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3765Einstein Status2008-06-17T14:13:44Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working. Possibly because of Gourd's weirdness.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] - Can send and receive locally, but remote e-mails are send-only.<br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]] - Functioning, have to migrate cyrus e-mails from old Einstein.<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein]<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3762Einstein Status2008-06-16T13:53:09Z<p>Steve: /* Mail server */</p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
<br />
This setup apparently confused Maurik, who wondered why we need to use the original 250gb drive involved. The answer, in case it ever needs to be known in the future, is that Matt didn't want to waste a perfectly good drive with no other purpose. It's perfectly reasonable to change the drive to 750 and complete the 3-way raiding fun, if desired.<br />
<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?<br />
<br />
==Mail server==<br />
[http://wiki.dovecot.org/Migration/Cyrus This should come in handy moving mail to the new einstein]<br />
Dovecot, postfix, squirrelmail, and mailman have all been installed. The logical plan of attack seems to be to get postfix working fully, then dovecot, then squirrel, and finally mailman, in order to satisfy dependencies as well as order of importance.<br />
Because single large files scare me, I'm initially trying to set up postfix with maildir format. Default is mbox, but since mbox stores each user's mail as a single big file, it just looks like it's too easy to lose lots of data from random disk errors. Mbox worked, but once I copied over einstein's configs, things stopped working due to hostname/resolution errors, as can be seen in <code>/var/log/maillog</code>. '''maildir is far more robust anyhow; the Dovecot documentation even recommends it, so I don't know why mbox is the default.'''</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3759Einstein Status2008-06-12T19:54:32Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: ???? Farm: Done!<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]? - This will be handled by roentgen.<br />
## Fortran compilers and things like that? (Also needs compat libs) - Isn't this what pumpkin is for?<br />
# Switch einstein.<br />
<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.<br />
==Mystery Behavior==<br />
einstein2 isn't connected to the UNH network, yet somehow can access it and anybody else:<br />
<pre><br />
traceroute to google.com (64.233.167.99), 30 hops max, 40 byte packets<br />
1 gourd.farm.physics.unh.edu (10.0.0.252) 0.149 ms 0.137 ms 0.129 ms<br />
2 faculty1-gw.unh.edu (132.177.88.1) 0.546 ms 0.640 ms 0.704 ms<br />
3 tccat1-sup.unh.edu (132.177.84.134) 1.369 ms 1.456 ms 1.490 ms<br />
4 catwan.unh.edu (132.177.100.1) 4.955 ms 4.971 ms 5.001 ms<br />
5 64.215.24.177 (64.215.24.177) 8.399 ms 8.234 ms 8.248 ms<br />
6 te7-1-10G.ar2.DCA3.gblx.net (67.17.109.34) 14.197 ms 28.104 ms 14.382 ms<br />
7 google-2.ar2.DCA3.gblx.net (64.215.195.182) 16.258 ms 16.267 ms 16.283 ms<br />
8 209.85.130.12 (209.85.130.12) 16.852 ms 29.388 ms 209.85.130.18 (209.85.130.18) 17.033 ms<br />
9 209.85.248.221 (209.85.248.221) 34.669 ms 209.85.252.165 (209.85.252.165) 35.986 ms 216.239.46.224 (216.239.46.224) 41.596 ms<br />
10 72.14.238.89 (72.14.238.89) 42.588 ms 42.414 ms 72.14.238.90 (72.14.238.90) 36.595 ms<br />
11 72.14.232.70 (72.14.232.70) 42.013 ms 64.233.175.42 (64.233.175.42) 36.180 ms 72.14.232.70 (72.14.232.70) 36.979 ms<br />
12 64.233.175.42 (64.233.175.42) 51.523 ms py-in-f99.google.com (64.233.167.99) 42.752 ms 50.091 ms<br />
</pre><br />
Gourd, huh? What's this about, Aaron?</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Exports&diff=3757Exports2008-06-11T20:12:13Z<p>Steve: </p>
<hr />
<div>''/etc/exports'' contains the directories that a machine will export over NFS.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3744Einstein Status2008-06-10T15:31:10Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done ('''VLAN''')! Farm: Done!<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Services&diff=3743Services2008-06-10T15:23:12Z<p>Steve: </p>
<hr />
<div>The preferred method is system-config-services, which is a pretty self-explanatory program.<br />
== chkconfig ==<br />
=== Useful incantations ===<br />
; <code>chkconfig --list [service]</code> : Lists all of the services' statuses, or just a particular service's.<br />
; <code>chkconfig --add name</code> : The service named "name" will start during bootup.<br />
; <code>chkconfig --del name</code> : The service named "name" will not start during bootup.<br />
; <code>chkconfig --level levels [--add or --del]</code> : Specify the levels to add or delete a service. The levels are given as a string of numbers from 0 to 7. E.g. "35" specifies runlevels 3 and 5.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Basic_Administration&diff=3742Basic Administration2008-06-10T15:13:43Z<p>Steve: </p>
<hr />
<div>These pages document some of the things we want to do frequently, like how to make sure the backup worked.<br />
* [[Add a new user]]<br />
* [[Burn a CD]]<br />
* [[Cacti]] server monitoring/graphing software<br />
* [[LVM|Logical Volume Management]]<br />
* [[Grub]]<br />
* How to [[Mounting a disk image|mount disk images]]<br />
* [[NPG backup on Lentil]]<br />
* [[Client Recipe|Recipe]] for getting a client machine up and running<br />
* Push updates through the [[RHN|RedHat Network]]<br />
* [[Skype info]]<br />
* Disk monitoring with [[SMARTD]]<br />
* [[Splunk]] IT search engine<br />
* Setting up [[sudoers]] with LDAP<br />
* Administration for [[Perl]] (adding modules etc)<br />
* Adding/editing [[Printer Settings]]<br />
* Modifying [[Default Applications]]<br />
* [[Package Management]] with <code>yum</code> and <code>rpm</code><br />
* Modifying system [[Services]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3741Einstein Status2008-06-10T14:22:07Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done! Farm: Not done.<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done!<br />
## [[LDAP]] - Done!<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done!<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3740Einstein Status2008-06-10T14:21:20Z<p>Steve: Just had to change einstien2's hostname to einstein so that the TLS certs would work</p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done.<br />
## [[LDAP]] - Done.<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done.<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3739Einstein Status2008-06-10T14:19:56Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done.<br />
## [[LDAP]] - Can do searches but can't use for authentication. "hostname does not match CN in peer certificate"<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done.<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3738Einstein Status2008-06-10T14:08:09Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done.<br />
## [[LDAP]] - Can do searches but users can't log in.<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done.<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3737Einstein Status2008-06-10T14:07:56Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]] - Not working.<br />
## [[DNS]] - Done.<br />
## [[LDAP]] - Can do searches but user's can't log in.<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]] - Done.<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Automount&diff=3734Automount2008-06-09T19:18:01Z<p>Steve: </p>
<hr />
<div>According to its manual page, automount "manage[s] autofs mount points." It runs at startup and "sets up mount points for each entry in the master map, allowing them to be automatically mounted when accessed." ''autofs'' is the service program that does the mounting/unmounting when needed.<br />
== Server Configuration ==<br />
An NFS server must list the directories it serves in ''/etc/exports''.<br />
== General Configuration ==<br />
''automount'' looks in ''/etc/auto.master'' for a map. Our setup apparently requires a non-default mapping:<br />
/net /etc/auto.net<br />
It is important to note that that '''is not''' the default ''/etc/auto.net'' either.<br />
The detailed contents of ''auto.master'' and ''auto.net'' can be found here: [[Autofs Configuration Files]]. <br />
<br />
If SELinux is enabled, once the files are added/edited, their security context may get changed as well, and it may be necessary to run <code>restorecon -R -v /etc</code> before restarting ''autofs'', in order for ''autofs'' to work.<br />
<br />
== Backup Server Configuration ==<br />
In addition to the above changes, the backup server has the line<br />
/mnt/npg-daily /etc/auto.npg-daily<br />
in ''/etc/auto.master'', and<br />
# mount backup disks by label<br />
* -fstype=auto,noatime :-Lnpg-daily-&<br />
in ''/etc/auto.npg-daily'' . The '*' is for matching, and the '&' gets replaced by what is matched. So, for example, a search for "30" will mount the drive labelled "npg-daily-30". See the manual for ''autofs'' for more details.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3729Einstein Status2008-06-09T16:51:27Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]] - Done.<br />
## [[DNS]] - Done.<br />
## [[LDAP]] - Done.<br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3728Einstein Status2008-06-09T15:18:23Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - UNH: Done. Farm: Not done.<br />
## [[Iptables]]<br />
## [[DNS]] - Done.<br />
## [[LDAP]] <br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb1, and sdc1. /var/spool/imap (can be changed to match our dovecot configuration) is /dev/md1, which is a 3-way mirror comprised of sda2, sdb2, and sdc2. /home is a 2-way mirror of sdb3 and sdc3. sda3 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for mail and /home, since they're (some of) the most important things. Two 750gb's were added, and RHEL5 was reinstalled without a hitch. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root/mail and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3726Einstein Status2008-06-09T15:15:10Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - Farm: Done. UNH: Done.<br />
## [[Iptables]]<br />
## [[DNS]] - Done.<br />
## [[LDAP]] <br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb2, and sdc1. /home is a 2-way mirror of sdb2 and sdc2. sda2 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for /home, since it's (one of) the most important things. Two 750gb's were added, root was expanded to take up most of the original drive, thus dictating the size of root on the two new drives. Home was then made using the remaining space, giving PLENTY for anyone here. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3725Einstein Status2008-06-09T15:14:54Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces - Farm: Done<br />
## [[Iptables]]<br />
## [[DNS]] - Done.<br />
## [[LDAP]] <br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb2, and sdc1. /home is a 2-way mirror of sdb2 and sdc2. sda2 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for /home, since it's (one of) the most important things. Two 750gb's were added, root was expanded to take up most of the original drive, thus dictating the size of root on the two new drives. Home was then made using the remaining space, giving PLENTY for anyone here. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3724Einstein Status2008-06-09T15:14:00Z<p>Steve: </p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces.<br />
## [[Iptables]]<br />
## [[DNS]] - Done.<br />
## [[LDAP]] <br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.<br />
<br />
==Current setup on einstein2==<br />
/ is /dev/md0, which is a 3-way mirror comprised of sda1, sdb2, and sdc1. /home is a 2-way mirror of sdb2 and sdc2. sda2 is the swap partition.<br />
The reason it is set up this way is that the system came installed on a 250gb, and Matt wanted redundancy and space for /home, since it's (one of) the most important things. Two 750gb's were added, root was expanded to take up most of the original drive, thus dictating the size of root on the two new drives. Home was then made using the remaining space, giving PLENTY for anyone here. Grub should currently be installed on all three drives, so that if any one (or two!) drives fails, the system can still boot and run. The RAID setup is standard software raid1 using 3 elements for root and 2 elements for home. This will allow us to put the drives in any other system if need be.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Client_Recipe&diff=3721Client Recipe2008-06-05T20:11:20Z<p>Steve: working on better tar/script</p>
<hr />
<div>A simple ''n''-step process to set up a client lickety-split:<br />
# Install Fedora in the typical fashion, skipping the steps for creating a default user and network authentication<br />
# Log in as root<br />
# Run system-config-network<br />
# If there isn't one already, add an ethernet device on eth0.<br />
# If this client is not in the server room (and therefore not going to use a VLAN), skip to the next full step<br />
## Choose to statically set the IP address to an available local number (10.0.0.*)<br />
## Give the device the alias "farm".<br />
## Run <code>vconfig add eth0 2</code> to create a virtual device "eth0.2"<br />
## Use system-config-network to add an ethernet device to eth0.2<br />
# Alias it "unh"<br />
# Choose to statically set the IP address to whatever was registered for the client<br />
# Set the gateway to 132.177.88.1<br />
# Under the general network configuration "DNS" tab, put the appropriate IPs of einstein and roentgen for primary and secondary DNS (local for farm as the primary connection, unh for unh as the primary connection)<br />
# Save the changes made with system-config-network<br />
# If a virtual device was added:<br />
## Open /etc/sysconfig/network-scripts/ifcfg-unh in a text editor<br />
## Add the line <code>VLAN=yes</code>, and save<br />
# If there are any more devices already present, disable, remove or configure them as well. Whatever you do, don't leave them defaulted to DHCP mode, otherwise their existence will change /etc/resolv.conf !<br />
# Run gtk-authconfig<br />
# Check "Enable LDAP Support" under the "User Information" and "Authentication" tabs<br />
# Click "Configure LDAP..."<br />
# The base DN is dc=physics,dc=unh,dc=edu and the server is einstein.unh.edu.<br />
# "Download CA Certificate" doesn't ever seem to work, so get "unh_physics_ca.crt" from einstein and put it in /etc/openldap/cacerts" (hint: <code>scp</code>).<br />
# Click OK in LDAP Settings.<br />
# Click OK in authconfig<br />
# Copy the appropriate content into the [[Autofs Configuration Files]]<br />
# Reboot</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3719Sysadmin Todo List2008-06-05T18:18:10Z<p>Steve: /* Einstein Upgrade */ einstein 2 instead</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# <font color="red">Do not update nss_ldap on RHEL5 machines until they fix it</font><br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
=== Miscellaneous ===<br />
* '''Mariecurie''': New NVIDIA video card seems to have the same problems as the ATI ones.<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': <font color="red">It was all thanks to nss_ldap</font>Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Einstein_Status&diff=3718Einstein Status2008-06-05T13:48:51Z<p>Steve: Time to start over</p>
<hr />
<div>[http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/index.html Massive amount of deployment documentation for RHEL 5]<br />
<br />
# Check all services einstein currently provides. Locate as many custom scripts, etc. as is reasonable and label/copy them.<br />
## Network interfaces.<br />
## [[Iptables]]<br />
## [[DNS]]<br />
## [[LDAP]] <br />
## [[Postfix]] <br />
## [[AMaViS]] <br />
## [[ClamAV]]<br />
## [[SpamAssassin]] <br />
## [[Dovecot]]<br />
## [[automount|/home]]<br />
## [[Samba]] If anyone needs samba access, they need to find us and have us make them a samba account. No LDAP integration.<br />
## [[Web Servers|Web]]?<br />
## Fortran compilers and things like that? (Also needs compat libs)<br />
# Switch einstein <-> tomato, and then upgrade what was originally einstein<br />
# Look into making an einstein, tomato failsafe setup.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3714Sysadmin Todo List2008-06-03T17:42:54Z<p>Steve: /* Miscellaneous */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# <font color="red">Do not update nss_ldap on RHEL5 machines until they fix it</font><br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''Mariecurie''': New NVIDIA video card seems to have the same problems as the ATI ones.<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': <font color="red">It was all thanks to nss_ldap</font>Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3713Sysadmin Todo List2008-06-03T17:41:52Z<p>Steve: /* Miscellaneous */ old info</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# <font color="red">Do not update nss_ldap on RHEL5 machines until they fix it</font><br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': <font color="red">It was all thanks to nss_ldap</font>Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3712Sysadmin Todo List2008-06-02T16:09:55Z<p>Steve: /* Daily Check off list */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# <font color="red">Do not update nss_ldap on RHEL5 machines until they fix it</font><br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': <font color="red">It was all thanks to nss_ldap</font>Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3711Sysadmin Todo List2008-06-02T16:07:57Z<p>Steve: /* Completed */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': <font color="red">It was all thanks to nss_ldap</font>Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3710Sysadmin Todo List2008-06-02T16:07:13Z<p>Steve: /* Miscellaneous */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''pumpkin/lentil/mariecurie/einstein2/corn''': Here's a summary of the symptoms, none of which occur for root:<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur. '''Turned out that the newest version of bash on el5 systems is broken. Replacing it with an older version fixes the issue with bash. Tcsh still gives issues, but that appears to be unrelated.'''<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3706Sysadmin Todo List2008-05-29T13:22:24Z<p>Steve: /* Miscellaneous */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''pumpkin/lentil''': Here's a summary of the symptoms, none of which occur for root:<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur.<br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. Figuring out what update that is could be tough though since there was a huge chunk of updates at some point (although I suspect the problem might be bash since tcsh depends on it). However, einstein2 is pretty much a clean slate, so a simple if tedious method would be to apply updates to it gradually until the symptoms pop up.<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3705Sysadmin Todo List2008-05-29T13:19:27Z<p>Steve: /* Miscellaneous */ damn you rhel 5</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''pumpkin/lentil''': Here's a summary of the symptoms, none of which occur for root:<br />
** tcsh can only run built-in commands; anything else results in a "Broken pipe" and the program not running. "Broken pipe" appears even for non-existent programs (e.g. "hhhhhhhh" will make "Broken pipe" appear).<br />
** bash has piping problems for Steve, but sometimes not for Matt. Something like <code>ls | wc</code> will print nothing and <code>echo $?</code> will print 141, aka SIGPIPE. Backticks will cause similar problems. Something like <code>echo `ls`</code> will print nothing, and <code>`ls`; echo $?</code> also prints 141. Since several system-provided startup scripts rely on strings returned from backticks, bash errors will print upon login.<br />
** '''None''' of the bash problems seem to happen when logged onto the physical machine, rather than over SSH, but '''all''' of the tcsh problems still occur.<br />
** Since this is now not just a pumpkin issue, the problem probably isn't corrupt files, but maybe some update messed something up. <br />
** Everything else seems fine on these two machines: disk usage, other programs, network, etc.<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3704Sysadmin Todo List2008-05-29T12:57:08Z<p>Steve: /* Completed */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3703Sysadmin Todo List2008-05-29T12:56:53Z<p>Steve: /* Miscellaneous */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Client_Recipe&diff=3701Client Recipe2008-05-23T20:35:38Z<p>Steve: </p>
<hr />
<div>Much of this work is done by copying the file ''einstein:/root/client.tar'' to the new system's root and doing <code>tar -xf client.tar</code>. It's still necessary to customize the network settings, of course.<br />
<br />
A simple ''n''-step process to set up a client lickety-split:<br />
# Install Fedora in the typical fashion, skipping the steps for creating a default user and network authentication<br />
# Log in as root<br />
# Run system-config-network<br />
# If there isn't one already, add an ethernet device on eth0.<br />
# If this client is not in the server room (and therefore not going to use a VLAN), skip to the next full step<br />
## Choose to statically set the IP address to an available local number (10.0.0.*)<br />
## Give the device the alias "farm".<br />
## Run <code>vconfig add eth0 2</code> to create a virtual device "eth0.2"<br />
## Use system-config-network to add an ethernet device to eth0.2<br />
# Alias it "unh"<br />
# Choose to statically set the IP address to whatever was registered for the client<br />
# Set the gateway to 132.177.88.1<br />
# Under the general network configuration "DNS" tab, put the appropriate IPs of einstein and roentgen for primary and secondary DNS (local for farm as the primary connection, unh for unh as the primary connection)<br />
# Save the changes made with system-config-network<br />
# If a virtual device was added:<br />
## Open /etc/sysconfig/network-scripts/ifcfg-unh in a text editor<br />
## Add the line <code>VLAN=yes</code>, and save<br />
# If there are any more devices already present, disable, remove or configure them as well. Whatever you do, don't leave them defaulted to DHCP mode, otherwise their existence will change /etc/resolv.conf !<br />
# Run gtk-authconfig<br />
# Check "Enable LDAP Support" under the "User Information" and "Authentication" tabs<br />
# Click "Configure LDAP..."<br />
# The base DN is dc=physics,dc=unh,dc=edu and the server is einstein.unh.edu.<br />
# "Download CA Certificate" doesn't ever seem to work, so get "unh_physics_ca.crt" from einstein and put it in /etc/openldap/cacerts" (hint: <code>scp</code>).<br />
# Click OK in LDAP Settings.<br />
# Click OK in authconfig<br />
# Copy the appropriate content into the [[Autofs Configuration Files]]<br />
# Reboot</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Backups&diff=3696Backups2008-05-23T14:25:30Z<p>Steve: /* Current Backup System */</p>
<hr />
<div>NPG backups runs from the dedicated backup server: [[Lentil]], which has 4 hot-swappable drive bays, generally containing SATA drives.<br />
<br />
To put in a new (fresh) drive:<br />
# Locate the oldest disk.<br />
# Make sure it is not mounted.<br />
# Open the appropriate drive door and slide out drive.<br />
# Put new drive in. (there are 4 screws holding the drive in place).<br />
# Slide it back in. Take note which Linux drive it registers as: /dev/sdb or /dev/sdc or /dev/sdd or /dev/sde <br/> NOTE: The order does NOT correspond with the slots, and this order can change after a reboot!<br />
# Run <code>/usr/local/bin/format_archive_disk.pl <disk no> <device></code> <br/>I.E: <code>/usr/local/bin/format_archive_disk.p 29 /dev/sde</code>[http://www.mikerubel.org/computers/rsync_snapshots/]<br />
# Check that the drive is available: <code>ls /mnt/npg-daily/29</code><br />
== Current Backup System ==<br />
Newer backups do something involving hard-linking files that haven't changed between backup sessions. Seems like a good idea, but we need to learn exactly how it works. (It's a poor-man's version of the Apple Time-Machine.)<br />
For old backups in the new format, consolidation works by putting all the data from each backup session into one place, overwriting with the newest data. Nobody's going to look for a specific version of a file they had in 2004 that only existed for three days, so this method is relatively safe, in terms of data retention.<br />
<br />
The script that does the backing-up is ''/usr/local/bin/rsync_backup.pl'' and the script that periodically runs it and sends out a notification e-mail is ''/etc/cron.daily/rsync_backup''. ''[[rsync_backup.pl]]'' determines what disk to put the backup onto, etc. Client machines must have Lentil's public SSH key, and Lentil must have the appropriate [[Autofs Configuration Files#auto.npg-daily for lentil|automount configuration]].<br />
<br />
[http://www.mikerubel.org/computers/rsync_snapshots/ Here] is a nice little guide on incremental, hardlinked backups via rsync. He sets up some nice little tricks with NFS mounts so that users can access their stuff read-only for any backup that's actually stored. We should do this.<br />
<br />
On lentil, pre-HDD-change, perl was obliterated at 0:16:00 on 2007-6-21. This date is BEFORE we started even looking at perl stuff. Its filesize was 0 bytes. A quick fix was to overwrite the 0-byte lentil perl binary with a copy of improv's perl binary. Using rpm to force reinstall perl-5.8.5 from yum's cache restored the correct version. The cause was later found to be due to the drive going bad. <font color="red">2008-05-23:</font> Since this has happened again, I've tar'd up the backup script, ''/etc/ssh/'', and the automount configs and saved them to ''einstein:/root/lentil.tar''.<br />
<br />
[http://rsync.samba.org/documentation.html rsync documentation]<br />
<br />
===Client Configuration===<br />
An important aspect of the current backup system is that it requires ssh to get through to the node you want to backup. The rsync program uses ssh to pull data from the node. This requires a special setup for ssh on each node to allow this to happen. Each node has a file /etc/rsync-backup.conf that controls what is backed up for that node. The backup system then executes a remote command: <code>rsync --server --daemon --config=/etc/rsync-backup.conf .</code> on the node. In the ssh configuration file (''/root/.ssh/authorized_keys'') the node has a special line for allowing this command to be executed by Lentil. Don't forget to have the file ''/etc/rsync-backup.conf'' on each machine, and to have something meaningful in it. ''/root/debug_rsync'' is also needed.<br />
<br />
===Important to note===<br />
* '''Do NOT use disks smaller than 350 GB for backup!!''', since those will not even fit one copy of what needs to be backed up.<br />
* The link /mnt/npg-daily-current must exist and point to an actual drive.<br />
<br />
== Legacy Backups ==<br />
The really old amanda-style backups are tar'ed, gzip'ed, and have weird header info. Looking at the head of them gives instructions for extraction.<br />
A script was written to extract and consolidate these backups. Shrinks hundreds of gigs down to tens of gigs, and zipping that shrinks it further. Very handy for OLD files we're not going to look at ever again.<br />
<br />
===Amanda backup consolidator===<br />
<pre>#!/bin/bash<br />
# This script was designed to extract data from the old tape-style backups<br />
# and put the data in an up-to-date (according to the backups) directory<br />
# tree. We can then, either manually or a different script, tar that into a<br />
# comprehensive backup. This should be quite a bit more space-efficient than<br />
# incrementals.<br />
# -----------------------------<br />
# My first attempt at a "smart" bash script, and one that takes input.<br />
# Probably not the best way to do it, but it works!<br />
# ~Matt Minuti<br />
if [ -z $1 ]<br />
then<br />
echo "Syntax: amandaextract.sh [string to seach for]"<br />
echo "This script searches /mnt/tmp for files containing the"<br />
echo "given string, and does the appropriate extraction"<br />
exit<br />
fi<br />
# Test to see if destination directory exists<br />
if [ -d /mnt/tmp2/$1 ] #<br />
then #<br />
echo "Directory /mnt/tmp2/$1 already exists."<br />
else<br />
mkdir /mnt/tmp2/$1 # If it doesn't, make it!<br />
fi<br />
<br />
NPG=$( ls /mnt/tmp/ ) # Set where to look for backup tree<br />
for i in $NPG; do # Cycle through the folders in order to...<br />
cd /mnt/tmp/$i<br />
cd ./data/<br />
FILES=$( ls ) # Get a listing of files<br />
for j in $( echo "$FILES" | grep "\.$1" ) ; do <br />
echo "Extracting file $j from $( pwd ) to /mnt/tmp2/$1"<br />
dd if=$j bs=32k skip=1 | /usr/bin/gzip -dc | /bin/tar -xf - -C /mnt/tmp2/$1<br />
done # The above for statement takes each matching file and extracts<br />
done # it to the desired location.<br />
</pre><br />
An example of how I've been using it is I have an amanda backup drive on /mnt/tmp, and an empty drive (or one with enough space) on /mnt/tmp2. Running <code>amandabackup.sh einstein</code> will go into each folder in /mnt/tmp, looking for anything with a name containing "einstein," and doing the appropriate extraction into /mnt/tmp2/einstein. The effect is extracting the oldest backups first, and then overwriting with newer and newer partial backups, ending finally with the most up-to-date backup in that amanda set. I then tar and bzip the resultant folder structure to save even more space.<br />
<br />
== Emergency Backup ==<br />
An easy way to make a backup of, say, lentil when its root drive is dying, is to use the program ''dd_rescue'' from a rescue disc, to copy the drive contents to another. The backup can then be mounted as a loopback device to access its contents.<br />
<br />
This came in handy in the case of a certain computer, say, lentil. We forgot to copy the autofs scripts and ssh keys, but it wasn't a big deal since we just mounted the drive image and bam! Everything was nice.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Backups&diff=3695Backups2008-05-23T13:05:51Z<p>Steve: /* Current Backup System */</p>
<hr />
<div>NPG backups runs from the dedicated backup server: [[Lentil]], which has 4 hot-swappable drive bays, generally containing SATA drives.<br />
<br />
To put in a new (fresh) drive:<br />
# Locate the oldest disk.<br />
# Make sure it is not mounted.<br />
# Open the appropriate drive door and slide out drive.<br />
# Put new drive in. (there are 4 screws holding the drive in place).<br />
# Slide it back in. Take note which Linux drive it registers as: /dev/sdb or /dev/sdc or /dev/sdd or /dev/sde <br/> NOTE: The order does NOT correspond with the slots, and this order can change after a reboot!<br />
# Run <code>/usr/local/bin/format_archive_disk.pl <disk no> <device></code> <br/>I.E: <code>/usr/local/bin/format_archive_disk.p 29 /dev/sde</code>[http://www.mikerubel.org/computers/rsync_snapshots/]<br />
# Check that the drive is available: <code>ls /mnt/npg-daily/29</code><br />
== Current Backup System ==<br />
Newer backups do something involving hard-linking files that haven't changed between backup sessions. Seems like a good idea, but we need to learn exactly how it works. (It's a poor-man's version of the Apple Time-Machine.)<br />
For old backups in the new format, consolidation works by putting all the data from each backup session into one place, overwriting with the newest data. Nobody's going to look for a specific version of a file they had in 2004 that only existed for three days, so this method is relatively safe, in terms of data retention.<br />
<br />
The script that does the backing-up is ''/usr/local/bin/rsync_backup.pl'' and the script that periodically runs it and sends out a notification e-mail is ''/etc/cron.daily/rsync_backup''. ''[[rsync_backup.pl]]'' determines what disk to put the backup onto, etc. Client machines must have Lentil's public SSH key, and Lentil must have the appropriate [[Autofs Configuration Files#auto.npg-daily for lentil|automount configuration]].<br />
<br />
[http://www.mikerubel.org/computers/rsync_snapshots/ Here] is a nice little guide on incremental, hardlinked backups via rsync. He sets up some nice little tricks with NFS mounts so that users can access their stuff read-only for any backup that's actually stored. We should do this.<br />
<br />
On lentil, <font color="red">pre-HDD-change</font>, perl was obliterated at 0:16:00 on 2007-6-21. This date is BEFORE we started even looking at perl stuff. Its filesize was 0 bytes. A quick fix was to overwrite the 0-byte lentil perl binary with a copy of improv's perl binary. Using rpm to force reinstall perl-5.8.5 from yum's cache restored the correct version. The cause was later found to be due to the drive going bad.<br />
<br />
[http://rsync.samba.org/documentation.html rsync documentation]<br />
<br />
===Client Configuration===<br />
An important aspect of the current backup system is that it requires ssh to get through to the node you want to backup. The rsync program uses ssh to pull data from the node. This requires a special setup for ssh on each node to allow this to happen. Each node has a file /etc/rsync-backup.conf that controls what is backed up for that node. The backup system then executes a remote command: <code>rsync --server --daemon --config=/etc/rsync-backup.conf .</code> on the node. In the ssh configuration file (''/root/.ssh/authorized_keys'') the node has a special line for allowing this command to be executed by Lentil. Don't forget to have the file ''/etc/rsync-backup.conf'' on each machine, and to have something meaningful in it. ''/root/debug_rsync'' is also needed.<br />
<br />
===Important to note===<br />
* '''Do NOT use disks smaller than 350 GB for backup!!''', since those will not even fit one copy of what needs to be backed up.<br />
* The link /mnt/npg-daily-current must exist and point to an actual drive.<br />
<br />
== Legacy Backups ==<br />
The really old amanda-style backups are tar'ed, gzip'ed, and have weird header info. Looking at the head of them gives instructions for extraction.<br />
A script was written to extract and consolidate these backups. Shrinks hundreds of gigs down to tens of gigs, and zipping that shrinks it further. Very handy for OLD files we're not going to look at ever again.<br />
<br />
===Amanda backup consolidator===<br />
<pre>#!/bin/bash<br />
# This script was designed to extract data from the old tape-style backups<br />
# and put the data in an up-to-date (according to the backups) directory<br />
# tree. We can then, either manually or a different script, tar that into a<br />
# comprehensive backup. This should be quite a bit more space-efficient than<br />
# incrementals.<br />
# -----------------------------<br />
# My first attempt at a "smart" bash script, and one that takes input.<br />
# Probably not the best way to do it, but it works!<br />
# ~Matt Minuti<br />
if [ -z $1 ]<br />
then<br />
echo "Syntax: amandaextract.sh [string to seach for]"<br />
echo "This script searches /mnt/tmp for files containing the"<br />
echo "given string, and does the appropriate extraction"<br />
exit<br />
fi<br />
# Test to see if destination directory exists<br />
if [ -d /mnt/tmp2/$1 ] #<br />
then #<br />
echo "Directory /mnt/tmp2/$1 already exists."<br />
else<br />
mkdir /mnt/tmp2/$1 # If it doesn't, make it!<br />
fi<br />
<br />
NPG=$( ls /mnt/tmp/ ) # Set where to look for backup tree<br />
for i in $NPG; do # Cycle through the folders in order to...<br />
cd /mnt/tmp/$i<br />
cd ./data/<br />
FILES=$( ls ) # Get a listing of files<br />
for j in $( echo "$FILES" | grep "\.$1" ) ; do <br />
echo "Extracting file $j from $( pwd ) to /mnt/tmp2/$1"<br />
dd if=$j bs=32k skip=1 | /usr/bin/gzip -dc | /bin/tar -xf - -C /mnt/tmp2/$1<br />
done # The above for statement takes each matching file and extracts<br />
done # it to the desired location.<br />
</pre><br />
An example of how I've been using it is I have an amanda backup drive on /mnt/tmp, and an empty drive (or one with enough space) on /mnt/tmp2. Running <code>amandabackup.sh einstein</code> will go into each folder in /mnt/tmp, looking for anything with a name containing "einstein," and doing the appropriate extraction into /mnt/tmp2/einstein. The effect is extracting the oldest backups first, and then overwriting with newer and newer partial backups, ending finally with the most up-to-date backup in that amanda set. I then tar and bzip the resultant folder structure to save even more space.<br />
<br />
== Emergency Backup ==<br />
An easy way to make a backup of, say, lentil when its root drive is dying, is to use the program ''dd_rescue'' from a rescue disc, to copy the drive contents to another. The backup can then be mounted as a loopback device to access its contents.<br />
<br />
This came in handy in the case of a certain computer, say, lentil. We forgot to copy the autofs scripts and ssh keys, but it wasn't a big deal since we just mounted the drive image and bam! Everything was nice.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Sysadmin_Todo_List&diff=3694Sysadmin Todo List2008-05-22T20:36:10Z<p>Steve: /* Miscellaneous */</p>
<hr />
<div>This is an unordered set of tasks. Detailed information on any of the tasks typically goes in related topics' pages, although usually not until the task has been filed under [[Sysadmin Todo List#Completed|Completed]].<br />
== Daily Check off list ==<br />
Each day when you come in check the following:<br />
# Einstein ([[Script Prototypes|script]]):<br />
## Up and running?<br />
## Disks are at less than 90% full?<br />
## Mail system OK? (spamassasin, amavisd, ...)<br />
# Temperature OK? No water blown into room?<br />
# Systems up: Taro, Pepper, Pumpkin/Corn ?<br />
# Backups:<br />
## Did backup succeed?<br />
## Does Lentil need a new disk?<br />
<br />
== Important ==<br />
<br />
=== Towards a stable setup ===<br />
<br />
Here are some thoughts, especially to Steve, about getting an über-stable setup for the servers. <br><br />
Some observations:<br />
# When we get to DeMeritt at the end of next summer, we need a setup that easily ports to the new environment. We will also be limited to a total of 10 kWatts heat load (36000 BTUs, or 3 tons of cooling), due to the cooling of the room. That sounds like a lot, but Silas and Jiang-Ming will also put servers in this space. Our footprint should be no more than 3 to 4 kWatts of heat load.<br />
# Virtual systems seems like the way to go. However, our experience with Xen is that it does not lead to highly portable VMs.<br />
# VMware Server is now a free product. They make money consulting and selling fancy add-ons. I have good experience with VMware Workstation on Mac's and Linux. But it is possible (like RedHat which was once free) that they will start charging when they reach 90% or more market share.<br />
<br />
Here are some options: <br><br />
* We get rid of Tomato, Jalapeno, Gourd and Okra and perhaps also Roentgen. If we want we can scavenge the parts from Tomato & Jalapeno (plus old einstein) for a toy system, or we park these systems in the corner. I don't want to waste time on them. The only bumps that I can think of here would be that Xemed/Aaron use Gourd. Otherwise I think we're all in favor of cutting down on the number of physical machines that we've got running. Oh, and what about the paulis? '''Since they're not under our "jurisdiction", they'll probably end up there anyhow.<br />
* Test VMware server (See [[VMWare Progress]]). Specifically, I would like to know:<br />
## How easy is it to move a VM from one hardware to another? (Can you simply move the disks?) '''Yes.'''<br />
## Specifically, if you need to service some hardware, can you move the host to other hardware with little down time? (Clearly not for large disk arrays, like pumpkin, but that is storage, not hosts). '''Considering portability of disks/files, the downtime is the time it takes to move the image around and start up on another machine.'''<br />
## Do we need a RedHat license for each VM or do we only need a license for the host, as with Xen? '''It seems to consume a license per VM. Following [http://kbase.redhat.com/faq/FAQ_103_10754.shtm this] didn't work for the VMWare systems. The closes thing to an official word that I could find was [http://www.redhat.com/archives/taroon-list/2004-August/msg00292.html this].'''<br />
## VMware allows for "virtual appliances", but how good are these really? Are these fast enough?<br />
* Evaluate the hardware needs. Pumpkin, the new Einstein, Pepper, Taro and Lentil seem to be all sufficient quality and up to date. Do we need another HW? If so, what?<br />
<br />
=== Einstein Upgrade ===<br />
<br />
Einstein upgrade project and status page: [[Einstein Status]]<br />
'''Note:''' Einstein (current one) has a problem with / getting full occasionally. See [[Einstein#Special_Considerations_for_Einstein]]<br />
<br />
It seems this is not moving forward sufficiently. I think we need a new strategy to get this accomplished. My new thought is to abandon the Tomato hardware, which may have been a source of the difficulties, and use what we learned for the setup of "Einstein on RHEL5" to create a virtual machine Tomato, where we test the upgrade to RHEL5. <br />
<br />
<br />
<br />
=== Miscellaneous ===<br />
* '''Lentil''': Gotta reinstall a whole bunch of things and/or a new disk; looks like there was some damage from the power problem on Monday (the size 0 files have returned).<br />
* '''MarieCurie''': Feynman's video card doesn't fit in any of mariecurie's slots (is it AGP or something?). I'm going to see if blackbody's Nvidia (which can do widescreen with the "nv" driver) fits, whenever Sarah isn't busy with her machine. If it does, then she can take the card, at least while I mess around with her ATI card in blackbody. '''blackbody's card doesn't fit in any of the slots, either. I have no clue what kind of connection it needs. The next step is to just look for and order a PCI/PCI-X NVidia card that's known to work at 1680x1050 on RedHat.''' Tried the 6200LE, had the same problems as the ATI cards!<br />
* Lepton constantly has problems printing. It seems that at least once a month the queue locks up. This machine has Fedora Core 3 installed, I wonder if it would be more worth it to just put CentOS on it and be done with this recurring problem.<br />
* Fermi has problems allowing me to log in. nsswitch.conf looks fine, getent passwd shows all the users like it's supposed to. There are no restrictions in ''/etc/security/access.conf'', either.<br />
* Gourd won't let me (Matt) log in, saying no such file or directory when trying to chdir to my home, and then it boots me off. Trying to log in as root from einstein is successful just long enough for it to tell me when the last login was, then boots me. '''(Steve here) I was able to log in and do stuff, but programs were intermittently slow.<br />
* Clean out some users who have left a while ago. (Maurik should do this.)<br />
* '''Monitoring''': I would like to see the new temp-monitor integrated with Cacti, and fix some of the cacti capabilities, i.e. tie it in with the sensors output from pepper and taro (and tomato/einstein). Setup sensors on the corn/pumpkin. Have an intelligent way in which we are warned when conditions are too hot, a drive has failed, a system is down. '''I'm starting to get the hang of getting this sort of data via snmp. I wrote a perl script that pulls the temperature data from the environmental monitor, as well as some nice info from einstein. We SHOULD be able to integrate a rudimentary script like this into cacti or splunk, getting a bit closer to an all-in-one monitoring solution. It's in Matt's home directory, under code/npgmon/'''<br />
* Check into smartd monitoring (and processing its output) on Pepper, Taro, Corn/Pumpkin, Einstein, Tomato.<br />
* Decommission Okra. - This system is way too outdated to bother with it. Move Cacti to another system. Perhaps a VM, once we get that figured out?<br />
* Learn how to use [[cacti]]. We should consider using a VM appliance to do this, so it's minimal configuration, and since okra's only purpose is to run cacti.<br />
<br />
== Ongoing ==<br />
=== Documentation ===<br />
* '''<font color="red" size="+1">Maintain the Documentation of all systems!</font>'''<br />
** Main function<br />
** Hardware<br />
** OS<br />
** Network<br />
* Continue homogenizing the configurations of the machines.<br />
* Improve documentation of [[Software Issues#Mail Chain Dependencies|mail software]], specifically SpamAssassin, Cyrus, etc.<br />
=== Maintenance ===<br />
* Check e-mails to root every morning<br />
* Check up on security [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-sec-network.html#ch-wstation]<br />
* Clean up Room 202.<br />
** Start reorganizing things back into boxes for the August move.<br />
** Ask UNH if they have are willing/able to recycle/reuse the three CRTs '''and old machines''' that we have sitting around. '''Give them away if we have to.'''<br />
<br />
=== On-the-Side ===<br />
* Learn how to use ssh-agent for task automation.<br />
* Backup stuff: We need exclude filters on the backups. We need to plan and execute extensive tests before modifying the production backup program. Also, see if we can implement some sort of NFS user access. '''I've set up both filters and read-only snapshot access to backups at home. Uses what essentially amounts to a bash script version of the fancy perl thing we use now, only far less sophisticated. However, the filtering and user access uses a standard rsync exclude file (syntax in man page) and the user access is fairly obvious NFS read-only hosting.''' <font color="green"> I am wondering if this is needed. The current scheme (ie the perl script) uses excludes by having a .rsync-filter is each of the directories where you want excluded contents. This has worked well. See ~maurik/tmp/.rsync-filter . The current script takes care of some important issues, like incomplete backups.</font> Ah. So we need to get users to somehow keep that .rsync-filter file fairly updated. And to get them to use data to hold things, not home. Also, I wasn't suggesting we get rid of the perl script, I was saying that I've become familiar with a number of the things it does. [http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/Deployment_Guide-en-US/ch-acls.html#s2-acls-mounting-nfs]<br />
* Continue purgin NIS from ancient workstations, and replacing with files. The following remain:<br />
** pauli nodes -- Low priority!<br />
<br />
== Waiting ==<br />
* That guy's computer has a BIOS checksum error. Flashing the BIOS to the newest version succeeds, but doesn't fix the problem. No obvious mobo damage either. What happen? '''Who was that guy, anyhow?''' (Silviu Covrig, probably) The machine is gluon, according to him. '''Waiting on ASUS tech support for warranty info''' Aaron said it might be power-supply-related. '''Nope. Definitely not. Used a known good PSU and still got error, reflashed bios with it and still got error. '''Got RMA, sending out on wed.''' Waiting on ASUS to send us a working one!''' Called ASUS on 8/6, they said it's getting repaired right now. '''Wohoo! Got a notification that it shipped!''' ...they didn't fix it... Still has the EXACT same error it had when we shipped it to them. '''What should we do about this?''' I'm going to call them up and have a talk, considering looking at the details on their shipment reveals that they sent us a different motherboard, different serial number and everything but with the same problem.<br />
* Printer queue for Copier: Konica Minolta Bizhub 750. IP=pita.unh.edu '''Seems like we need info from the Konica guy to get it set up on Red Hat. The installation documentation for the driver doesn't mention things like the passcode, because those are machine-specific. Katie says that if he doesn't come on Monday, she'll make an inquiry.''' <font color="green">Mac OS X now working, IT guy should be here week of June 26th</font> '''Did he ever come?''' No, he didn't, and did not respond to a voice message left. Will call again.<br />
* Sent an email to UNH Property Control asking what the procedure is to get rid of untagged equipment, namely, the two old monitors in the corner. Apparently they want us to fill out lots of information on the scrapping form like if it was paid for with government money, etc, as well as give them serial numbers, model numbers, and everything we can get ahold of. Then, we get to hang onto them until the hazardous equipment people come in and take it out, at their leisure. Waiting to figure out what we want to do with them.<br />
<br />
== Completed ==<br />
* '''jalapeno hangups:''' Look at sensors on jalapeno, so that cacti can monitor the temp. The crashing probably isn't the splunk beta (no longer beta!), since it runs entirely in userspace. '''lm_sensors fails to detect anything readable. Is there a way around this?''' Jalapeno's been on for two weeks with no problems, let's keep our fingers crossed&hellip; '''This system is too unstable to maintain, like tomato and old einstein.''' Got an e-mail today, saying it's got a degraded array. I just turned it off since it's just a crappy space heater at this point.<br />
* Resize/clean up partitions as necessary. Seems to be a running trend that a computer gets 0 free space and problems crop up. '''This hasn't happened in half a year. I think it was a coincidence that a few computers had it happen at once.'''<br />
* Put new drive in lentil, npg-daily-33. '''That's good, because 32 is almost full already. 81%!'''<br />
* <b><font color="red">CLAMAV died and no one noticed!</font></b> The update of clamav (mail virus scanner on einstein) on April 23rd killed this mail subsystem because some of the option in /etc/clamd.conf were now obsolete. See http://www.sfr-fresh.com/unix/misc/clamav-0.93.tar.gz:a/clamav-0.93/NEWS. This seemed to have gone unnoticed for a while. Are we sleeping at the wheel? Edited /etc/clamd.conf to comment out these options.<br />
* When I came in today (22nd), taro had kernel panicked and einstein was acting strangely. Checking root's email, I saw that all day the 21st and 22nd, there were SMTP errors, around 2 per minute. A quick glance at them gives me the impression that they're spam attempts, due to ridiculous FROM fields like <code>pedrofinancialcompany.inc.net@tiscali.dk</code>. I rebooted taro and einstein, everything seems fine now.<br />
* Pauli crashes nearly every day, not when backups come around. We need to set up detailed system logging to find out why. Pauli2 and 4 don't give out their data via /net to the other paulis. This doesn't seem to be an autofs setting, since I see nothing about it in the working nodes' configs. Similarly, 2,4, and 6 won't access the other paulis via /net. 2,4 were nodes we rebuilt this summer, so it makes sense they don't have the right settings, but 6 is a mystery. Pauli2's hard drive may be dying. Some files in /data are inaccessible, and smartctl shows a large number of errors (98 if I'm reading this right...). Time to get Heisenberg a new hard drive? '''Or maybe just wean him off of NPG&hellip;''' It may be done for; can't connect to pauli2 and rebooting didn't seem to work. Need to set up the monitor/keyboard for it & check things out. '''The pauli nodes are all off for now. They've been deemed to produce more heat than they're worth. We'll leave them off until Heisenberg complains.''' Heisenberg's complaining now. Fixed his pauli machine by walking in the room (still don't know what he was talking about) and dirac had LDAP shut off. He wants the paulis up whenever possible, which I explained could be awhile because of the heat issues. ''' Pauli doesn't crash anymore, as far as I can tell. Switching the power supply seems to have done it.'''<br />
* Pumpkin is now stable. Read more on the configuration at [[Pumpkin]] and [[Xen]].<br />
* Roentgen was plugged into one of the non-battery-backup slots of its UPS, so I shut it down and moved the plug. After starting back up, root got a couple of mysterious e-mails about /dev/md0 and /dev/md2: <code>Array /dev/md2 has experienced event "DeviceDisappeared"</code>. However, <code>mount</code> seems to indicate that everything important is around:<br />
<pre><br />
/dev/vg_roentgen/rhel3 on / type ext3 (rw,acl)<br />
none on /proc type proc (rw)<br />
none on /dev/pts type devpts (rw,gid=5,mode=620)<br />
usbdevfs on /proc/bus/usb type usbdevfs (rw)<br />
/dev/md1 on /boot type ext3 (rw)<br />
none on /dev/shm type tmpfs (rw)<br />
/dev/vg_roentgen/rhel3_var on /var type ext3 (rw)<br />
/dev/vg_roentgen/wheel on /wheel type ext3 (rw,acl)<br />
/dev/vg_roentgen/srv on /srv type ext3 (rw,acl)<br />
/dev/vg_roentgen/dropbox on /var/www/dropbox type ext3 (rw)<br />
/usr/share/ssl on /etc/ssl type none (rw,bind)<br />
/proc on /var/lib/bind/proc type none (rw,bind)<br />
automount(pid1503) on /net type autofs (rw,fd=5,pgrp=1503,minproto=2,maxproto=4)<br />
</pre>and all of the sites listed on [[Web Servers]] work. Were those just old arrays that aren't around anymore but are still listed in some config file? '''We haven't seen any issues, and roentgen's going to be virtualized in the not-to-distant future, so this is fairly irrelevant.'''<br />
* Gourd's been giving smartd errors, namely<code><br />
Offline uncorrectable sectors detected:<br />
/dev/sda [3ware_disk_00] - 48 Time(s)<br />
1 offline uncorrectable sectors detected<br />
</code>Okra also has an offline uncorrectable sector! '''No sign of problems since this was posted.'''<br />
<br />
== Previous Months Completed ==<br />
[[Completed in June 2007|June 2007]]<br />
<br />
[[Completed in July 2007|July 2007]]<br />
<br />
[[Completed in August 2007|August 2007]]<br />
<br />
[[Completed in September 2007|September 2007]]<br />
<br />
[[Completed in October 2007|October 2007]]<br />
<br />
[[Completed in November/December 2007|NovDec 2007]]<br />
<br />
[[Completed in January 2008|January 2008]]<br />
<br />
[[Completed in February 2008|February 2008]]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Backups&diff=3693Backups2008-05-22T16:54:47Z<p>Steve: /* Amanda backup consolidator */</p>
<hr />
<div>NPG backups runs from the dedicated backup server: [[Lentil]], which has 4 hot-swappable drive bays, generally containing SATA drives.<br />
<br />
To put in a new (fresh) drive:<br />
# Locate the oldest disk.<br />
# Make sure it is not mounted.<br />
# Open the appropriate drive door and slide out drive.<br />
# Put new drive in. (there are 4 screws holding the drive in place).<br />
# Slide it back in. Take note which Linux drive it registers as: /dev/sdb or /dev/sdc or /dev/sdd or /dev/sde <br/> NOTE: The order does NOT correspond with the slots, and this order can change after a reboot!<br />
# Run <code>/usr/local/bin/format_archive_disk.pl <disk no> <device></code> <br/>I.E: <code>/usr/local/bin/format_archive_disk.p 29 /dev/sde</code>[http://www.mikerubel.org/computers/rsync_snapshots/]<br />
# Check that the drive is available: <code>ls /mnt/npg-daily/29</code><br />
== Current Backup System ==<br />
Newer backups do something involving hard-linking files that haven't changed between backup sessions. Seems like a good idea, but we need to learn exactly how it works. (It's a poor-man's version of the Apple Time-Machine.)<br />
For old backups in the new format, consolidation works by putting all the data from each backup session into one place, overwriting with the newest data. Nobody's going to look for a specific version of a file they had in 2004 that only existed for three days, so this method is relatively safe, in terms of data retention.<br />
<br />
The script that does the backing-up is ''/usr/local/bin/rsync_backup.pl'' and the script that periodically runs it and sends out a notification e-mail is ''/etc/cron.daily/rsync_backup''. ''[[rsync_backup.pl]]'' determines what disk to put the backup onto, etc.<br />
<br />
[http://www.mikerubel.org/computers/rsync_snapshots/ Here] is a nice little guide on incremental, hardlinked backups via rsync. He sets up some nice little tricks with NFS mounts so that users can access their stuff read-only for any backup that's actually stored. We should do this.<br />
<br />
On lentil, <font color="red">pre-HDD-change</font>, perl was obliterated at 0:16:00 on 2007-6-21. This date is BEFORE we started even looking at perl stuff. Its filesize was 0 bytes. A quick fix was to overwrite the 0-byte lentil perl binary with a copy of improv's perl binary. Using rpm to force reinstall perl-5.8.5 from yum's cache restored the correct version. The cause was later found to be due to the drive going bad.<br />
<br />
[http://rsync.samba.org/documentation.html rsync documentation]<br />
<br />
===Client Configuration===<br />
An important aspect of the current backup system is that it requires ssh to get through to the node you want to backup. The rsync program uses ssh to pull data from the node. This requires a special setup for ssh on each node to allow this to happen. Each node has a file /etc/rsync-backup.conf that controls what is backed up for that node. The backup system then executes a remote command: <code>rsync --server --daemon --config=/etc/rsync-backup.conf .</code> on the node. In the ssh configuration file (''/root/.ssh/authorized_keys'') the node has a special line for allowing this command to be executed by Lentil. Don't forget to have the file ''/etc/rsync-backup.conf'' on each machine, and to have something meaningful in it. ''/root/debug_rsync'' is also needed.<br />
<br />
===Important to note===<br />
* '''Do NOT use disks smaller than 350 GB for backup!!''', since those will not even fit one copy of what needs to be backed up.<br />
* The link /mnt/npg-daily-current must exist and point to an actual drive.<br />
<br />
== Legacy Backups ==<br />
The really old amanda-style backups are tar'ed, gzip'ed, and have weird header info. Looking at the head of them gives instructions for extraction.<br />
A script was written to extract and consolidate these backups. Shrinks hundreds of gigs down to tens of gigs, and zipping that shrinks it further. Very handy for OLD files we're not going to look at ever again.<br />
<br />
===Amanda backup consolidator===<br />
<pre>#!/bin/bash<br />
# This script was designed to extract data from the old tape-style backups<br />
# and put the data in an up-to-date (according to the backups) directory<br />
# tree. We can then, either manually or a different script, tar that into a<br />
# comprehensive backup. This should be quite a bit more space-efficient than<br />
# incrementals.<br />
# -----------------------------<br />
# My first attempt at a "smart" bash script, and one that takes input.<br />
# Probably not the best way to do it, but it works!<br />
# ~Matt Minuti<br />
if [ -z $1 ]<br />
then<br />
echo "Syntax: amandaextract.sh [string to seach for]"<br />
echo "This script searches /mnt/tmp for files containing the"<br />
echo "given string, and does the appropriate extraction"<br />
exit<br />
fi<br />
# Test to see if destination directory exists<br />
if [ -d /mnt/tmp2/$1 ] #<br />
then #<br />
echo "Directory /mnt/tmp2/$1 already exists."<br />
else<br />
mkdir /mnt/tmp2/$1 # If it doesn't, make it!<br />
fi<br />
<br />
NPG=$( ls /mnt/tmp/ ) # Set where to look for backup tree<br />
for i in $NPG; do # Cycle through the folders in order to...<br />
cd /mnt/tmp/$i<br />
cd ./data/<br />
FILES=$( ls ) # Get a listing of files<br />
for j in $( echo "$FILES" | grep "\.$1" ) ; do <br />
echo "Extracting file $j from $( pwd ) to /mnt/tmp2/$1"<br />
dd if=$j bs=32k skip=1 | /usr/bin/gzip -dc | /bin/tar -xf - -C /mnt/tmp2/$1<br />
done # The above for statement takes each matching file and extracts<br />
done # it to the desired location.<br />
</pre><br />
An example of how I've been using it is I have an amanda backup drive on /mnt/tmp, and an empty drive (or one with enough space) on /mnt/tmp2. Running <code>amandabackup.sh einstein</code> will go into each folder in /mnt/tmp, looking for anything with a name containing "einstein," and doing the appropriate extraction into /mnt/tmp2/einstein. The effect is extracting the oldest backups first, and then overwriting with newer and newer partial backups, ending finally with the most up-to-date backup in that amanda set. I then tar and bzip the resultant folder structure to save even more space.<br />
<br />
== Emergency Backup ==<br />
An easy way to make a backup of, say, lentil when its root drive is dying, is to use the program ''dd_rescue'' from a rescue disc, to copy the drive contents to another. The backup can then be mounted as a loopback device to access its contents.<br />
<br />
This came in handy in the case of a certain computer, say, lentil. We forgot to copy the autofs scripts and ssh keys, but it wasn't a big deal since we just mounted the drive image and bam! Everything was nice.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Package_Management&diff=3692Package Management2008-05-22T15:54:48Z<p>Steve: /* yum */</p>
<hr />
<div>Every machine has yum and rpm installed, and these are the main methods of adding software to a machine.<br />
==yum==<br />
This program is a higher-level interface to <code>rpm</code> and repositories. RedHat systems must be registered with RHN to make full use of yum, because registration is the only way to access the offical RedHat repositories. Other repositories can be added by adding entries in ''/etc/yum.repos.d/'', but this should be avoided unless some non-RedHat-supplied software is really needed.<br />
<br />
It's written in Python, so be careful when updating python.<br />
<br />
On RHEL 5, yum has replaced "up2date" for getting updates/packages/etc. that are requested on the RHN website.<br />
===Useful invocations===<br />
; <code>yum search string1 [string2] [&hellip;]</code> : Search for packages whose name or description match the given string(s)<br />
; <code>yum install package1 [package2] [&hellip;]</code> : Is used to install the latest version of a package or group of packages while ensuring that all dependencies are satisfied. If no package matches the given package name(s), they are assumed to be a shell glob and any matches are then installed. Sometimes a specific version of package must be specified (e.g. 32-bit versus 64-bit, or if mutliple versions of the same program are installed for some insane reason), see [[#Package Names|Package Names]].<br />
; <code>yum update [package1] [package2] [&hellip;]</code> : Update packages with same name rules as "install". If no packages are named, all installed packages will be updated. Sometimes that's a bad thing; always use "check-update" first.<br />
; <code>yum check-update</code> : Implemented so you could know if your machine had any updates that needed to be applied without running it interactively. Returns exit value of 100 if there are packages available for an update. Also returns a list of the pkgs to be updated in list format. Returns 0 and no packages are available for update.<br />
; <code>yum remove package1 [package2] [&hellip;]</code> : Remove packages ''and all packages that depend on them''.<br />
Note that there is no way to simply "reinstall" a package when using yum. Instead, a remove-install combo may be necessary. However, this can be trouble, because could wipe out a lot of innocent packages. So, either use rpm for this (''preferred'') or make sure to copy the list of packages that yum reports will be removed and install the whole list again.<br />
===Package Names===<br />
A package can be referred to for install,update,list,remove etc. with any of the following:<br />
* name<br />
* name.arch<br />
* name-ver<br />
* name-ver-rel<br />
* name-ver-rel.arch<br />
* name-epoch:ver-rel.arch<br />
* epoch:name-ver-rel.arch<br />
<br />
==rpm==<br />
Mostly used by us to install packages that aren't in a repository and to reinstall broken packages without messing with packages that depend on them.<br />
===Useful invocations===<br />
; <code>rpm -i package_file1 [package_file2] [&hellip;]</code> : Installs the listed packages. Note that these are pathnames and not package names as used for yum.<br />
; <code>rpm -e --nodeps package_name1 [package_name2] [&hellip;]</code> : Removes the named packages, without removing packages that depend on them. Note that these are package names and not pathnames like when installing. I'm pretty sure the names don't follow the same rules as for yum, so the whole package name might be necessary.<br />
==up2date==<br />
This is, for pre-RHEL5 systems, the only method of getting updates/packages/etc. that are requested on the RHN website.<br />
===Useful invocations===<br />
; <code>up2date</code> : Runs the interactive, GUI version, so if you SSH'd in, make sure it was with X forwarding. Not the preferred method, just because the interactive aspect isn't usually needed.<br />
; <code>up2date -u</code> : Gets and installs the updates from RHN in "unattended" mode, with no GUI. Generally the best method, although some systems have some updates flagged to be skipped (things like new kernels).</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Talk:Improv&diff=3691Talk:Improv2008-05-21T19:02:43Z<p>Steve: </p>
<hr />
<div>This is outdated now. Isn't improv now feynman and improv now in Maurik's office?</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Talk:Exports&diff=3690Talk:Exports2008-05-21T19:01:15Z<p>Steve: </p>
<hr />
<div>Um, what is this?</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Iptables&diff=3689Iptables2008-05-21T19:00:33Z<p>Steve: /* Details */</p>
<hr />
<div>The iptables is part of the standard Red-Hat / Linux firewall. The usual way to configure this is through the guis, but BEWARE, we have a customized setup.<br />
<br />
The reason for the customization is that this allows us to use netgroups, i.e. we pull lists of system names from the LDAP database and allow certain services to every system in that list.<br />
<br />
== Configuration ==<br />
<br />
The normal configuration for the iptables is in /etc/sysconfig/iptables and /etc/sysconfig/iptables-config. The startup script is /etc/init.d/iptables<br />
<br />
We have customizations as follows:<br />
* /etc/init.d/iptables-netgroups This script runs /usr/local/bin/netgroup2iptables.pl <br />
* /usr/local/bin/netgroup2iptables.pl A perl scripts which pulls the netgroup information from the LDAP. It uses "iptables-save" (system command) to get the current iptables.<br />
* /etc/sysconfig/iptables-npg The iptables that the iptables-config points to for the data.<br />
<br />
Note that this system has a vulnerability: The iptables-npg can become corrupted on an /etc/init.d/iptables save command.<br />
<br />
== Details ==<br />
These should be added. Specifically, who what where why things are blocked would be nice. It'd be nice to see if it's possible to move to a simpler setup &mdash; one that doesn't require LDAP bootstrapping ugliness.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Perl&diff=3688Perl2008-05-21T18:57:53Z<p>Steve: </p>
<hr />
<div>==Perl Administration==<br />
<br />
Install a new module for Perl from CPAN:<br />
perl -MCPAN -e 'install Base::Modulename' # to install Base::Modulename, eg Crypt::PasswdMD5<br />
Make sure there's not an official package, first though. We've had some problems with module dependencies bite us (ldapcat and friends on einstein).<br />
<br />
You can find what modules exists in [http://search.cpan.org/ search cpan] [http://www.cpan.org/modules/00modlist.long.html cpan modules list] [http://kobesearch.cpan.org/ kobesearch]</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Pumpkin&diff=3687Pumpkin2008-05-21T18:56:20Z<p>Steve: </p>
<hr />
<div>Pumpkin is our new 8 CPU 24 disk monster machine. It is really, really nice.<br />
<br />
== Basic Setup == <br />
We run Xen on this so that it has two RHEL5 personalies: Pumpkin, 64-bit, and Corn, 32-bit. More information is at our [[Xen]] page.<br />
<br />
<font size = "+1"><font color = "#0000BB">'''Current Xen domU's:''' </font></font><br />
; Domain0: Pumpkin<br />
Red Hat EL5 - 64-bit 8-CPUs 24 GB of memory.<br />
; DomU: [[Corn]]<br>Red Hat EL5 - 32-bit 2 CPUs, 4 GB of memory. Para-virtualized, boots from /dev/sdb <br />
; DomU: [[Fermi]]<br>Red Hat EL4 - 64-bit 2 CPUs, 2 to 4 GB of memory, Para-virtualized, boots from /dev/sdg. <br />
; DomU: [[Compton]]<br>Red Hat EL4 - 32-bit 2 CPUs, 2 to 4 GB of memory, '''Fully'''-virtualized (cannot mix and match x86_64 and i686 for RHEL4), boots from /dev/sdh. <br />
; DomU: [[Landau]]<br> Red Hat EL4 or EL5 - boots from /dev/sdj. -- System for experimenting.<br />
<br />
<font size = "+1"><font color = "#0000BB">'''RAID Setup'''</font></font><br />
<br />
The RAID is currently split. This allows for much easier maintenance and, in the future, possible upgrades.<br />
; Disk 1 to 11 : RAID Set 0, which holds the RAID Volumes: System (300GB, RAID6, SCSI:0.0.0), System1(300GB, RAID6, SCSI:0.0.1), Data1 (6833GB, RAID5, SCSI:0.0.2)<br />
; Disk 11 to 22 : RAID Set 1, which holds the RAID Volume: Data2 (7499GB, RAID5, SCSI:0.0.3)<br />
; Disk 23 and 24 : Passthrough (single disks) at SCSI:0.0.6 and SCSI:0.0.7. These can be used as spares, as backup, or to expand the other RAID sets later on. Currently they are seen as /dev/sde* and /dev/sdf*. /dev/sdf and /dev/sde are currently used for Virtual Systems.<br />
The RAID card can be monitored at http://10.0.0.99/ login as "admin" with a password that is the same as the door combo.<br />
* To use this card with Linux you need a driver: arcmsr. This '''must be part of the initrd''' for the kernel, else you cannot boot from the RAID. You can also install from the CDs, if you have a driver floppy. It will then add the arcmsr driver into the initrd for you. You will still '''always need to have this driver!'''<br />
* The kernel module can be built from the sources located on /dev/sdf in ''/usr/src/kernels/Acera_RAID''. Just run make.<br />
<br />
There exists a temporary drive which holds a RHEL5 distro and the original RHEL4 distro from the manufacturer. It is currently disconnected from pumpkin.<br />
<br />
== Virtual Host: Corn ==<br />
We run a '''32-bit''' personality "corn" using the Xen virtualization on pumpkin's /dev/sdb. Corn is a para-virtualized RHEL5 system, with pumpkin as the master host, or "domain0". It is a fully separate system (that could be booted as the main system with a few modifications to config files. Hint: don't do that!). This means that any system stuff installed on Pumpkin needs to be installed on corn separately.<br />
<br />
Subscription Issue: A virtual host needs to be setup special. See the [http://kbase.redhat.com/faq/FAQ_103_10754.shtm RedHat documentation]. Both host and guest need rhn-virtualization-common and rhn-virtualization-host installed. This is now fixed.<br />
<br />
The virtual host needs to have both ethernets bridged. According to [http://wiki.xensource.com/xenwiki/XenNetworking Xen wiki], this is done by modifying the ''/etc/xen/scripts/network-bridge'' script, which is now network-bridge-two which calls the original twice. For the host, create two interfaces: first the one to xenbr1 and then the one to xenbr0, so that the first one ends up being eth1 and the second one eth0. Yes, it seems backwards, but it now works. The key is to have the lines<br />
alias eth0 xennet<br />
alias eth1 xennet<br />
in the /etc/modprobe.conf file. This is now working.<br />
<br />
<br />
<br />
== To Do ==<br />
* There must be other things....<br />
* Setup sensors so that we can monitor the system. '''Will have to wait for a kernel that supports it'''<br />
<br />
== Done ==<br />
** Pumpkin's [[iptables]] seem messed up after this morning's (1/8/2008) GRUB trouble. With the old config (pepper's), iptables wouldn't let anything in at all, it seemed (specifically things like pingbacks, LDAP&hellip;). I've copied roentgen's ''/etc/sysconfig/iptables-npg'' to pumpkin for now, and everything seems to be working as usual. Previously it had a copy of pepper's, and pepper works, so I wonder what the real problem is. <font color="green">'''This was the wrong iptables!'''. I fixed it with a new set.</font><br />
* Sane iptables using ldap.<br />
* Setup ethernet.<br />
* Setup RAID volumes.<br />
* Setup partitions and create file systems.<br />
* Move the system to System drive and remove the current temp drive.<br />
* Setup mount points for the data drives.<br />
* Setup LDAP for users to log in.<br />
* Setup [[Exports]], so other systems can see the drives. '''There were issues with firewall, so I modeled the firewall after taro's.''' Seems to be working, I can successfully <code>ls /net/data/pumpkin1</code> and <code>ls /net/data/pumpkin2</code> on einstein.<br />
* Setup autofs so that it can see other drives. '''What other drives? It's working for einstein:/home''' Other drives such as data drives<br />
* Setup [[smartd]] so we will know when a disk is going bad. '''This can be done inside the RAID card''' using a system to send SNMP and EMAIL. but it needs to be done. '''E-mail seems to be set up, let's see if we get any through npg-admins'''<br />
* Restrict access (/etc/security/access.conf)<br />
* Setup sudo on both pumpkin and corn.<br />
* Add the new systems to the lentil backup script. '''They're on there; lentil just needs the right SSH keys to rsync them.'''<br />
* Setup SNMP for cacti monitoring.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Talk:Ssh_known_hosts&diff=3686Talk:Ssh known hosts2008-05-21T18:53:21Z<p>Steve: </p>
<hr />
<div>This file and page probably need updated</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=SNMPD&diff=3685SNMPD2008-05-21T18:51:37Z<p>Steve: </p>
<hr />
<div>Real useful, hard to understand the details.<br />
<br />
== Access control and config ==<br />
<br />
See /etc/snmpd/snmpd.conf. This is where you allow/restrict access and can control what is presented. If only you could figure out the file.<br />
<br />
== Log Level control ==<br />
<br />
Logging is controlled on the command line, see the '''logging options''' in<br />
[http://net-snmp.sourceforge.net/docs/man/snmpcmd.html snmpcmd man page]<br />
The actual file to change is /etc/sysconfig/snmpd.options or /etc/snmpd/snmpd.options where the line<br />
OPTIONS="-LS 0-4 d -Lf /dev/null -p /var/run/snmpd.pid -a"<br />
has the necessary magic. See /etc/init.d/snmpd for the defailt.</div>Stevehttps://nuclear.unh.edu/wiki/index.php?title=Package_Management&diff=3683Package Management2008-05-21T18:32:48Z<p>Steve: </p>
<hr />
<div>Every machine has yum and rpm installed, and these are the main methods of adding software to a machine.<br />
==yum==<br />
This program is a higher-level interface to <code>rpm</code> and repositories. RedHat systems must be registered with RHN to make full use of yum, because registration is the only way to access the offical RedHat repositories. Other repositories can be added by adding entries in ''/etc/yum.repos.d/'', but this should be avoided unless some non-RedHat-supplied software is really needed.<br />
<br />
On RHEL 5, yum has replaced "up2date" for getting updates/packages/etc. that are requested on the RHN website.<br />
===Useful invocations===<br />
; <code>yum search string1 [string2] [&hellip;]</code> : Search for packages whose name or description match the given string(s)<br />
; <code>yum install package1 [package2] [&hellip;]</code> : Is used to install the latest version of a package or group of packages while ensuring that all dependencies are satisfied. If no package matches the given package name(s), they are assumed to be a shell glob and any matches are then installed. Sometimes a specific version of package must be specified (e.g. 32-bit versus 64-bit, or if mutliple versions of the same program are installed for some insane reason), see [[#Package Names|Package Names]].<br />
; <code>yum update [package1] [package2] [&hellip;]</code> : Update packages with same name rules as "install". If no packages are named, all installed packages will be updated. Sometimes that's a bad thing; always use "check-update" first.<br />
; <code>yum check-update</code> : Implemented so you could know if your machine had any updates that needed to be applied without running it interactively. Returns exit value of 100 if there are packages available for an update. Also returns a list of the pkgs to be updated in list format. Returns 0 and no packages are available for update.<br />
; <code>yum remove package1 [package2] [&hellip;]</code> : Remove packages ''and all packages that depend on them''.<br />
Note that there is no way to simply "reinstall" a package when using yum. Instead, a remove-install combo may be necessary. However, this can be trouble, because could wipe out a lot of innocent packages. So, either use rpm for this (''preferred'') or make sure to copy the list of packages that yum reports will be removed and install the whole list again.<br />
===Package Names===<br />
A package can be referred to for install,update,list,remove etc. with any of the following:<br />
* name<br />
* name.arch<br />
* name-ver<br />
* name-ver-rel<br />
* name-ver-rel.arch<br />
* name-epoch:ver-rel.arch<br />
* epoch:name-ver-rel.arch<br />
==rpm==<br />
Mostly used by us to install packages that aren't in a repository and to reinstall broken packages without messing with packages that depend on them.<br />
===Useful invocations===<br />
; <code>rpm -i package_file1 [package_file2] [&hellip;]</code> : Installs the listed packages. Note that these are pathnames and not package names as used for yum.<br />
; <code>rpm -e --nodeps package_name1 [package_name2] [&hellip;]</code> : Removes the named packages, without removing packages that depend on them. Note that these are package names and not pathnames like when installing. I'm pretty sure the names don't follow the same rules as for yum, so the whole package name might be necessary.<br />
==up2date==<br />
This is, for pre-RHEL5 systems, the only method of getting updates/packages/etc. that are requested on the RHN website.<br />
===Useful invocations===<br />
; <code>up2date</code> : Runs the interactive, GUI version, so if you SSH'd in, make sure it was with X forwarding. Not the preferred method, just because the interactive aspect isn't usually needed.<br />
; <code>up2date -u</code> : Gets and installs the updates from RHN in "unattended" mode, with no GUI. Generally the best method, although some systems have some updates flagged to be skipped (things like new kernels).</div>Steve