Quick Integrated System Checklist
From UMaine Supercomputer
Every cluster on the open side is now controlled from one Torque/Moab/LDAP server. This means that when you type showq or qstat you will see jobs running on Kearney2, Bender and Fawlty. While possibly confusing at first (you'll see free nodes for all of the clusters) this has the great benefit of allowing you to manage all of your jobs from one host.
- Use OpenMPI.
- The great advantage here is that you will no longer need two separate binaries for either Ethernet or Myrinet runs.
- Use the mpiexec included with OpenMPI.
- /usr/local/ompi-<compiler>/bin/mpiexec
- This requires you specify the following:
- -np <number of processors>
- -mca btl self,tcp or -mca btl self,gm
- Update PBS Submission Scripts.
- The following lines are now required:
- #PBS -l arch=<string> where <string> is either ppc64 for the Xserves or x86_32 for the PIIIs
- The following lines are now required:
- Diagnose with Moab commands first, then Torque if necessary.
- Moab is far more advanced then the default scheduler included with Torque, and therefore knows a lot more about the state of your job and the nodes then Torque.
- This means that free nodes may not really be nodes that your job can run on, see the following:
- /opt/moab/bin/showq -- Replacement for qstat
- /opt/moab/bin/showstate -- Show brief status of running jobs and nodes. Also gives node locations. Racks 1x and 3x are Xserves, whereas Rack5x is the PIIIs. By default, jobs CANNOT cross over racks. This is done to vastly improve the performance of your jobs.
- /opt/moab/bin/mdiag -n -- Diagnose the state of all the nodes. Just because a node is free does not mean it can accept a job. Any number of issues may make a node unsuitable for jobs (rogue processes, connection failures, high load) so Moab will not schedule on them.
- /opt/moab/bin/checkjob <jobnumber> -- Replacement for qstat -f

