Running a Batch Job
From UMaine Supercomputer
Contents |
Introduction
Once one has generated a binary for the application they wish to run, the next step is creating a PBS script which is just a simple text file. The PBS script will be used to inform the scheduler of the resources required by the job. The scheduler in use at the ACRL is Moab from Cluster Resources. The resources manager is Torque, an open source solution from the same company. Torque informs Moab of the available resources in the cluster and Moab decides when to run the queued jobs.
PBS/Torque Commands
Anything preceded by a #PBS is a PBS/Torque command. Anything else with a # in front, not followed by PBS is ignored by the shell and Torque.
Mandatory Items
Below is a snippet from a valid PBS script. These items must be included in any PBS script.
#!/bin/bash #PBS -l nodes=2:ppn=2 #PBS -l walltime=30:00 #PBS -q linux-spool #PBS -A systemTest
- The first line specifies the shell interpreter I wish to use. In this instance, /bin/bash
- The second line specifies the number of processors I require. I need two nodes, and two processors per node, for a total of four processors.
- The third line specifies the maximum amount of time I believe my jobs will run. In this instance, 30 minutes. Syntax is <days>:<hours>:<minutes>:<seconds>
Walltime is important. Moab is smart enough to allow backfill and preemptive runs of jobs if possible. This greatly increases the efficiency of the cluster. For Moab to do it's job however, the walltime needs to as close to reality as possible.
- The fourth line specifies the queue I am submitting to, linux-spool
- The fifth line specifies the MyPBS account that the job will be billed to, systemTest
For more information on the various queues, see Queue Explanation
Some Variations on Requested Nodes
#Request 40 processors with Myrinet communication #PBS -l nodes=20:ppn=2:myrinet #Request 4 processors with Gigabit Ethernet. #PBS -l nodes=2:ppn=2:GigE #Request 4 processors, each on a separate node. This is good if your job #requires a lot of memory per process ( > 1GB ). No other jobs will run on #these nodes while your job is running. #PBS -l nodes=4:ppn=1
Various Useful PBS Commands
#Redirect Output and Error Files #PBS -o <output file> #PBS -e <error file> #Define an environment variable #PBS -v LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/user/lib" #Import all current environment variables from submitting shell #PBS -V #Give the Job a Name #PBS -N <some identifying string>
For more documentation on other various PBS commands, see the references below.
Shell Commands
Once you have filled in all of the PBS commands, it's time to actually launch your job. You now have a basic shell script, so anything you typically do in the shell can be done. Below I have written ways to launch the majority of the types of jobs I can think of.
#Some Shell commands just to demonstrate #Change to binaries/ in my home directory cd ~/binaries #Echo the date, the will be printed to the output file. echo `date` #Launch a job without communication on every requested processor. /usr/bin/mpiexec -np <number of processors> -mca btl self <binary> #Launch a myrinet job on all processors. /usr/bin/mpiexec -np <number of processors> -mca btl self,gm <binary> #Launch Ethernet job on all processors /usr/bin/mpiexec -np <number of processors> -mca btl self,tcp <binary>
We use Open MPI implementations installed across the cluster. OpenMPI uses /usr/bin/mpiexec. For more information on the flags for the Open MPI mpiexec, see the references below.
If you need to use other implementations of mpich, please send a request to noc at clusters dot umaine dot edu
Submitting the Job
This is the easiest part of the whole process. If you have done every thing correctly up to this point you may simply qsub the job. For example, here is a working script and submission. The job runs and is numbered 1424.
user@panopticon ~/src $ cat go.hostname #!/bin/bash #PBS -l nodes=2:ppn=2:myrinet #PBS -l walltime=30:00 #PBS -q linux-spool #PBS -o out #PBS -e err #PBS -A systemTest echo $PBS_JOBID date /usr/local/mpiexec/bin/mpiexec -comm=none hostname user@panopticon ~/src $ qsub go.hostname 1424.echelon.acrl.clusters.umaine.edu
Errors
A list of errors and how to fix them. Some of the qsub error messages are kind of cryptic.
- You forgot to include the #PBS -q <queue name>
user@panopticon ~/src $ qsub go.hostname qsub: No default queue specified MSG=cannot locate queue
- You specified an invalid queue.
user@panopticon ~/src $ qsub go.hostname qsub: Unknown queue MSG=cannot locate queue
- You do not have access to the queue specified
user@panopticon ~/src $ qsub go.hostname 1427.echelon.acrl.clusters.umaine.edu user@panopticon ~/src $ checkjob 1427 job 1427 AName: go.hostname State: Idle .... Holds: Batch:PolicyViolation NOTE: job cannot run (job has hold in place) NOTE: job hold active - Batch
- Torque is unable to lookup your account (email the NOC list to let us know)
user@panopticon ~/src $ qsub go.hostname qsub: Bad UID for job execution
The above is the list of common errors that I can recall at the moment. If you experience any other errors, let me know and I'll update the list.
Notes
I found that if you want an exact number of CPUs per node, you should replace "ppn=#" with "tpn=#" in your torque scripts. This will give you the exact number specified rather than a minimum number. Overall, it shouldn't matter since each computer only has 2 CPUs, but it is an interesting thing to note. The command syntax can be found at http://www.clusterresources.com/products/mwm/docs/13.3rmextensions.shtml#tpn
References
- OpenMPI Documentation See the tuning sections for mpiexec commands. (Replace occurences of mpirun with mpiexec)
- Torque Documentation] PBS commands are listed here.

