The Portable Batch System (PBS) is available as Open Source software from http://www.OpenPbs.org/. A commercial version can be bought from http://www.PBSPro.com/. The PBSPro also offer support for OpenPBS, and at a decent price for academic institutions.
There exists a very useful collection of user-contributed software/patches for Open PBS at http://www-unix.mcs.anl.gov/openpbs/.
This HowTo document outlines all the steps required to compile and install the Portable Batch System (PBS) version 2.1, 2.2 and 2.3. Most likely the steps will be the same for the PBSPro software.
The latest version of PBS is available from http://www.OpenPbs.org/. The PBS documentation available at the Web-site should be handy for in-depth discussion of the points covered in this HowTo.
We also discuss how to create a PBS script for parallel or serial jobs. The cleanup in an epilogue script may be required for parallel jobs.
Accounting Reports may be generated from PBS' accounting
files. We provide a simple tool pbsacct that processes
and formats the accounting into a useful report.
Download the latest version of pbsacct from the
ftp://ftp.fysik.dtu.dk/pub/PBS/ directory.
Feedback to this document was kindly provided by:
The following steps are what we use to install PBS from scratch on our systems. Please send corrections and additions to Ole.H.Nielsen (at) fysik.dtu.dk.
./configure --set-server-home=/var/spool/PBS --set-default-server=zeiseOn Compaq Tru64 UNIX make sure that you use the Compaq C-compiler in stead of the GNU gcc by doing "setenv CC cc". You should add these flags to the above configure command: --set-cflags="-g3 -O2". It is also important that the /var/spool/PBS does not include any soft-links, such as /var -> /usr/var, since this triggers a bug in the PBS code.
If you compiled PBS for a different architecture before, make sure to clean up before running configure:
gmake distclean
On AIX 4.1.5 edit src/tools/Makefile to add a library: LIBS = -lld
On Compaq Tru64 UNIX use the native Compaq C-compiler:
gmake CC=ccThe default CFLAGS are "-g -O2", but the Compaq compiler requires "-g3 -O2" for optimization. Set this with:
./configure (flags) --set-cflags="-g3 -O2"After the make has completed, install the PBS files as the root superuser:
gmake install
/usr/local/sbin/pbs_server -t create /usr/local/sbin/pbs_schedThe "-t create" should only be executed once, at the time of installation !!
The pbs_server and pbs_sched should be started at boot time: On Linux this is done automatically by /etc/rc.d/init.d/pbs. Otherwise use your UNIX's standard method (e.g. /etc/rc.local) to run the following commands at boot time:
/usr/local/sbin/pbs_server -a true /usr/local/sbin/pbs_schedThe "-a true" sets the scheduling attribute to True, so that jobs may start running.
Our current configuration is:
# qmgr Max open servers: 4 Qmgr: print server # # Create queues and set their attributes. # # # Create and define queue verylong # create queue verylong set queue verylong queue_type = Execution set queue verylong Priority = 40 set queue verylong max_running = 10 set queue verylong resources_max.cput = 72:00:00 set queue verylong resources_min.cput = 12:00:01 set queue verylong resources_default.cput = 72:00:00 set queue verylong enabled = True set queue verylong started = True # # Create and define queue long # create queue long set queue long queue_type = Execution set queue long Priority = 60 set queue long max_running = 10 set queue long resources_max.cput = 12:00:00 set queue long resources_min.cput = 02:00:01 set queue long resources_default.cput = 12:00:00 set queue long enabled = True set queue long started = True # # Create and define queue medium # create queue medium set queue medium queue_type = Execution set queue medium Priority = 80 set queue medium max_running = 10 set queue medium resources_max.cput = 02:00:00 set queue medium resources_min.cput = 00:20:01 set queue medium resources_default.cput = 02:00:00 set queue medium enabled = True set queue medium started = True # # Create and define queue small # create queue small set queue small queue_type = Execution set queue small Priority = 100 set queue small max_running = 10 set queue small resources_max.cput = 00:20:00 set queue small resources_default.cput = 00:20:00 set queue small enabled = True set queue small started = True # # Create and define queue default # create queue default set queue default queue_type = Route set queue default max_running = 10 set queue default route_destinations = small set queue default route_destinations += medium set queue default route_destinations += long set queue default route_destinations += verylong set queue default enabled = True set queue default started = True # # Set server attributes. # set server scheduling = True set server max_user_run = 6 set server acl_host_enable = True set server acl_hosts = *.fysik.dtu.dk set server acl_hosts = *.alpha.fysik.dtu.dk set server default_queue = default set server log_events = 63 set server mail_from = adm set server query_other_jobs = True set server resources_default.cput = 01:00:00 set server resources_default.neednodes = 1 set server resources_default.nodect = 1 set server resources_default.nodes = 1 set server scheduler_iteration = 60 set server default_node = 1#shared
Create the file /var/spool/PBS/mom_priv/config on all PBS nodes (server and clients) with the contents:
# The central server must be listed: $clienthost zeisewhere the correct servername must replace "zeise". You may add other relevant lines as recommended in the manual, for example for restricting access and for logging:
$logevent 0x1ff $restricted *.your.domain.name(list the domain names that you want to give access).
For maintenance of the configuration file, we use rdist to duplicate /var/spool/PBS/mom_priv/config from the server to all PBS nodes.
/usr/local/sbin/pbs_momor "/etc/rc.d/init.d/pbs start" on Linux. Make sure that MOM is started at boot time. See discussion under point 5.
On Compaq Tru64 UNIX 4.0E+F there may be a problem with starting
pbs_mom too soon. Some network problem makes pbs_mom report
errors in an infinite loop, which fills up the logfiles'
filesystem within a short time !
Several people told me that they don't have this problem,
so it's not understood at present.
The following section is only relevant if you have this problem
on Tru64 UNIX.
On Tru64 UNIX start pbs_mom from the last entry in /etc/inittab:
# Portable Batch System batch execution mini-server pbsmom::once:/etc/rc.pbs > /dev/console 2>&1The file /etc/rc.pbs delays the startup of pbs_mom:
#!/bin/sh
#
# Portable Batch System (PBS) startup
#
# On Digital UNIX, pbs_mom fills up the mom_logs directory
# within minutes after reboot. Try to sleep at startup
# in order to avoid this.
PBSDIR=/usr/local/sbin
if [ -x ${PBSDIR}/pbs_mom ]; then
echo PBS startup.
# Sleep for a while
sleep 120
${PBSDIR}/pbs_mom # MOM
echo Done.
else
echo Could not execute PBS commands !
fi
qstart default small medium long verylong qenable default small medium long verylongThis needs to be done only once and for all, at the time when you install PBS.
Add nodes using the qmgr command:
# qmgr Max open servers: 4 Qmgr: create node node99 properties=ev67where the node-name is node99 with the properties=ev67. Alternatively, you may simply list the nodes in the file /var/spool/PBS/server_priv/nodes:
server:ts ev67 node99 ev67The :ts indicates a time-shared node; nodes without :ts are cluster nodes where batch jobs may execute. The second column lists the properties that you associate with the node. Restart the pbs_server after editing manually the nodes file.
# qmgr Max open servers: 4 Qmgr: set server scheduling=true
Your PBS batch system ought to be fully functional at this point
so that you can submit batch jobs using the qsub command.
For debugging purposes, PBS offers you an "interactive batch job"
by using the command qsub -I.
As an example, you may use the following PBS batch script as a
template for creating your own batch scripts.
The present script runs an MPI parallel job on the available
processors:
If you specify #PBS -l nodes=1 in the script, you
will be running a non-parallel (or serial) batch job:
If a parallel job dies prematurely for any reason, PBS will
clean up user processes on the master-node only.
We (and others) have found that often MPI slave-processes
are lingering on all of the slave-nodes waiting for
communication from the (dead) master-process.
At present the only generally applicable way to clean up user processes
on the nodes allocated to a PBS job is to use the
PBS epilogue capability (see the PBS documentation).
The epilogue is executed on the job's master-node, only.
An epilogue script /var/spool/PBS/mom_priv/epilogue
should be created on every node, containing for example this:
On SMP nodes one cannot use the Super-kill command, since
the user's processes belonging to other PBS jobs might be terminated.
The present solution works correctly only on single-CPU nodes.
An alternative cleanup solution for Linux systems
is provided by Benjamin Webb of Oxford University.
This solution may work more reliably than the above.
Batch job scripts
#!/bin/sh
### Job name
#PBS -N test
### Declare job non-rerunable
#PBS -r n
### Output files
#PBS -e test.err
#PBS -o test.log
### Mail to user
#PBS -m ae
### Queue name (small, medium, long, verylong)
#PBS -q long
### Number of nodes (node property ev67 wanted)
#PBS -l nodes=8:ev67
# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This jobs runs on the following processors:
echo `cat $PBS_NODEFILE`
# Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
# Run the parallel MPI executable "a.out"
mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS a.out
#!/bin/sh
### Job name
#PBS -N test
### Declare job non-rerunable
#PBS -r n
### Output files
#PBS -e test.err
#PBS -o test.log
### Mail to user
#PBS -m ae
### Queue name (small, medium, long, verylong)
#PBS -q long
### Number of nodes (node property ev6 wanted)
#PBS -l nodes=1:ev6
# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
# Run your executable
a.out
Clean-up after parallel jobs
#!/bin/sh
echo '--------------------------------------'
echo Running PBS epilogue script
# Set key variables
USER=$2
NODEFILE=/var/spool/PBS/aux/$1
echo
echo Killing processes of user $USER on the batch nodes
for node in `cat $NODEFILE`
do
echo Doing node $node
su $USER -c "ssh -a -k -n -x $node skill -v -9 -u $USER"
done
echo Done.
The Secure Shell command ssh may be replaced by the remote-shell
command of your choice.
The skill (Super-kill) command is a nice tool available from
ftp://fast.cs.utah.edu/pub/skill/,
or as part of the Linux procps RPM-package.
This page is maintained by:
.
Last update: 07 Jan 2003
.
Copyright © 2003
`Center for Atomic-scale Materials Physics' .
All rights reserved.
Home