Difference between revisions of "Endeavour"
Line 31: | Line 31: | ||
= Initial setup and Configuration = | = Initial setup and Configuration = | ||
− | * Set the UNH IP address (endeavour.unh.edu) on eth1. | + | * Set the UNH IP address (endeavour.unh.edu) on eth1. <font color="green" ><b>[done]</b></font> |
− | ** This made the system think of itself as "endeavor" rather than "master", causing PBS to get confused. PBS in /var/spool/pbs adjusted, also the maui scheduler in /usr/local/maui/maui.cfg modified. | + | ** This made the system think of itself as "endeavor" rather than "master", causing PBS to get confused. PBS in /var/spool/pbs adjusted, also the maui scheduler in /usr/local/maui/maui.cfg modified. <font color="green" ><b>[done]</b></font> |
− | * I switched the IP address on eth0 to 10.0.0.100 from 10.0.0.1 (since that is the usual gateway address, and we want to bridge the two backend networks.) | + | * I switched the IP address on eth0 to 10.0.0.100 from 10.0.0.1 (since that is the usual gateway address, and we want to bridge the two backend networks.) <font color="green" ><b>[done]</b></font> |
− | ** '''This requires ALL "hosts" files on the nodes to be modified''' | + | ** '''This requires ALL "hosts" files on the nodes to be modified''' <font color="green" ><b>[done,all nodes but 25]</b></font> |
− | ** '''Also, the /root/.shosts /root/.rhosts and /etc/ssh/ssh_known_hosts files need to be copied from node2 to node*''' | + | ** '''Also, the /root/.shosts /root/.rhosts and /etc/ssh/ssh_known_hosts files need to be copied from node2 to node*''' <font color="green" ><b>[done,all nodes but 25]</b></font> |
− | ** The file /var/spool/pbs/server_name needs to be updated as well | + | ** The file /var/spool/pbs/server_name needs to be updated as well <font color="green" ><b>[done,all nodes but 25]</b></font> |
− | * Set the root password to standard scheme. | + | * Set the root password to standard scheme. <font color="green" ><b>[done,master only]</b></font> |
− | * Setup the LDAP client side. | + | * Setup the LDAP client side. <font color="green" ><b>[done,master only]</b></font> |
− | * Recompiled PBS to include the xpbs and xpbsmon commands. | + | * Recompiled PBS to include the xpbs and xpbsmon commands.<font color="green" ><b>[done]</b></font> |
− | * Configured and started the iptables firewall | + | * Configured and started the iptables firewall <font color="green" ><b>[done,master only]</b></font> |
− | * Integrated the backend network with the farm backend network (bridged the network switches) | + | * Integrated the backend network with the farm backend network (bridged the network switches) <font color="green" ><b>[done]</b></font> |
− | * Setup automount, standard /net/data and /net/home | + | * Setup automount, standard /net/data and /net/home <font color="green" ><b>[done,master only]</b></font> |
** TODO: We need a new rule that resolves /net/data/node2 for the disk in node2 etc. The nodes need to export their /scratch partition. The other partitions may not be needed, since the "rcpf" command (a foreach with rcp) can copy files in batch. | ** TODO: We need a new rule that resolves /net/data/node2 for the disk in node2 etc. The nodes need to export their /scratch partition. The other partitions may not be needed, since the "rcpf" command (a foreach with rcp) can copy files in batch. | ||
* The /etc/nodes file included the "master" node. This is too dangerous. It means that in a batch copy the file is also automatically copy back to the master, with potentially dangerous results. | * The /etc/nodes file included the "master" node. This is too dangerous. It means that in a batch copy the file is also automatically copy back to the master, with potentially dangerous results. | ||
− | * To add users to the "microway Ganglia control" part, add them to /etc/mcms.users Password is login password, LDAP is honored. | + | * To add users to the "microway Ganglia control" part, add them to /etc/mcms.users Password is login password, LDAP is honored. |
== TO DO == | == TO DO == | ||
− | + | # Figure out the monitoring system, Ganglia, and other Microway goodies.<font color="green" ><b>[partially done]</b></font> | |
− | # Figure out the monitoring system, Ganglia, and other Microway goodies. | ||
− | |||
− | |||
− | |||
− | |||
# LDAP (client) on nodes? | # LDAP (client) on nodes? | ||
− | + | # Practice PBS | |
+ | # Test the Infiniband and MPICH setup. | ||
+ | # 24 hour full system burnin? | ||
+ | |||
=== Long Term To Do === | === Long Term To Do === | ||
Line 62: | Line 60: | ||
Long term goal is to have Endeavour as an independent system is need be. | Long term goal is to have Endeavour as an independent system is need be. | ||
− | # | + | # Run a replicate LDAP server on Endeavour. |
− | # | + | # Run a replicate Named (DNS) server on Endeavour and Roentgen. |
+ | # Install Splunk and pass splunk data on to Pumpkin. | ||
# Replicate home directories for selected users (this may be too tricky, really)? Else create a local copy of each user. | # Replicate home directories for selected users (this may be too tricky, really)? Else create a local copy of each user. |
Revision as of 18:54, 1 May 2009
Endeavour
Here are the notes on this system.
Notes on configuration status/changes and ToDo is at the bottom.
Endeavor web server is active: Endeavour
It runs the Ganglia monitoring software on Endeavour Ganglia
Endeavor RAID card is connected to: 10.0.0.96
Endeavor SWITCH is connected to 10.0.0.253
System Usage
This section explains some of the special use for this system.
OpenPBS = Torque = Portable Batch System
PBS is a system for scheduling compute jobs onto nodes, aka "workload management software", that was first created by NASA in the '90s. We ran this early version on our farm back then. It is very sophisticated and thus not so trivial to configure. Some things are already setup.
The company supporting the old open source version is PBS Gridworks which seems to be a devision of "Altair". They haven't touched their free open version since 2001.
There are no manuals for OpenPBS from Altair, only for PBS Pro. To get to them, you need to create a username/password at the PBS Pro User Area you can then get to the Documentation. Do not expect a one to one correspondence between the OpenPBS and PBSPro versions (like, you don't need a FLEX license for the open one.)
The newer development in OpenPBS is renamed Torque, which is what is installed on our systems. See Cluster Resources and go to Torque Resource Manager. This includes documentation.
Commands
- pbsnodes
- This gives a quick overview of all the known nodes and whether they are up. If they are what the status is.
- xpbs
- Graphical interface to PBS
- xpbsmon
- Graphical interface to monitor nodes. (nice)
Initial setup and Configuration
- Set the UNH IP address (endeavour.unh.edu) on eth1. [done]
- This made the system think of itself as "endeavor" rather than "master", causing PBS to get confused. PBS in /var/spool/pbs adjusted, also the maui scheduler in /usr/local/maui/maui.cfg modified. [done]
- I switched the IP address on eth0 to 10.0.0.100 from 10.0.0.1 (since that is the usual gateway address, and we want to bridge the two backend networks.) [done]
- This requires ALL "hosts" files on the nodes to be modified [done,all nodes but 25]
- Also, the /root/.shosts /root/.rhosts and /etc/ssh/ssh_known_hosts files need to be copied from node2 to node* [done,all nodes but 25]
- The file /var/spool/pbs/server_name needs to be updated as well [done,all nodes but 25]
- Set the root password to standard scheme. [done,master only]
- Setup the LDAP client side. [done,master only]
- Recompiled PBS to include the xpbs and xpbsmon commands.[done]
- Configured and started the iptables firewall [done,master only]
- Integrated the backend network with the farm backend network (bridged the network switches) [done]
- Setup automount, standard /net/data and /net/home [done,master only]
- TODO: We need a new rule that resolves /net/data/node2 for the disk in node2 etc. The nodes need to export their /scratch partition. The other partitions may not be needed, since the "rcpf" command (a foreach with rcp) can copy files in batch.
- The /etc/nodes file included the "master" node. This is too dangerous. It means that in a batch copy the file is also automatically copy back to the master, with potentially dangerous results.
- To add users to the "microway Ganglia control" part, add them to /etc/mcms.users Password is login password, LDAP is honored.
TO DO
- Figure out the monitoring system, Ganglia, and other Microway goodies.[partially done]
- LDAP (client) on nodes?
- Practice PBS
- Test the Infiniband and MPICH setup.
- 24 hour full system burnin?
Long Term To Do
Possible long term tasks if manpower is available.
Long term goal is to have Endeavour as an independent system is need be.
- Run a replicate LDAP server on Endeavour.
- Run a replicate Named (DNS) server on Endeavour and Roentgen.
- Install Splunk and pass splunk data on to Pumpkin.
- Replicate home directories for selected users (this may be too tricky, really)? Else create a local copy of each user.