ARCHIVED: Batch jobs on Big Red
Note: Big Red, originally commissioned in 2006, was retired from service on September 30, 2013. Its replacement, Big Red II, is a hybrid CPU/GPU Cray XE6/XK7 system capable of achieving a theoretical peak performance (Rpeak) of 1 petaFLOPS, making it one of the two fastest university-owned supercomputers in the US. For more, see ARCHIVED: Get started on Big Red II. If you have questions or concerns about the retirement of Big Red, contact the High Performance Systems group.
On this page:
- Resource manager and scheduler overview
- Resource manager specifics
- Scheduler specifics
- Additional information
Resource manager and scheduler overview
Big Red includes 1,020 compute nodes intended for the non-interactive processing of batch jobs. Users submit jobs to run on the system through IBM's LoadLeveler resource manager. LoadLeveler, in turn, relies on Adaptive Computing's Moab scheduler to dispatch user jobs to appropriate and available compute nodes. This document provides additional detail regarding the operation of these components.
Resource manager specifics
The LoadLeveler resource manager consists primarily of a number of daemons that run on Big Red compute nodes, and a set of commands for user interaction. The compute node daemons are responsible for starting, monitoring, and stopping jobs on the compute nodes. The commands allow users to submit, modify, and cancel jobs, and monitor job status.
The compute nodes are organized into classes, or queues, with various features and restrictions intended to maximize the utilization of the cluster. For example, a few nodes are allocated to the DEBUG class, which allows only brief, small jobs to run. The purpose of this is to provide a fast-turnaround class for identifying and eliminating command-line errors or typos, rather than testing your job command file in a queue full of long-running large parallel jobs.
Job definition
A LoadLeveler job ("batch job", or simply "job") is
defined with a job command file. The command file includes keywords
that specify various attributes of the job, such as the number of
nodes required, the amount of time necessary for the job to complete,
and the commands the job will run on the compute nodes. Once a job
command file is prepared, the job may be submitted to LoadLeveler with
the llsubmit
command.
Keywords in the job command file begin with a hash (or pound sign)
followed by any number of spaces, followed by the @
symbol:
# @ <keyword> = <value>
Comments are lines that start with a hash, followed by anything
except a number of spaces and the @
symbol. Thus, you
may easily comment out keyword lines:
## @ <keyword> = <value> # ignored by LoadLeveler
You may continue keyword lines with the back-slash (\
)
character:
# @ <keyword> = <a long long long> \ <list of values>
Perhaps the simplest command file for a job on Big Red would be something like:
# @ executable = hello_world.sh # @ queue
The executable
keyword specifies the program that
should be run (its location is relative to the directory in which the
llsubmit
command is run, or the directory specified with
the initialdir
keyword, demonstrated below). The
queue
keyword marks the end of a job step (jobs may
consist of multiple steps; more on that later).
In the previous example, any output the hello_world.sh
script prints to STDOUT
or STDERR
will be
lost (directed to /dev/null
). Such output can be saved
with the output
and error
keywords,
respectively:
# @ output = hw.out # @ error = hw.err # @ executable = hello_world.sh # @ queue
The node class in which the job runs can be specified, as well as
the length of time required (class
and
wall_clock_limit
, respectively):
# @ class = DEBUG # @ wall_clock_limit = 10:00 # 10 minutes # @ output = hw.out # @ error = hw.err # @ executable = hello_world.sh # @ queue
The default value for wall_clock_limit
is two hours.
You should specify accurate wall-clock limit values to prevent your
jobs from either terminating before completion, or skewing the
scheduler's utilization calculations.
Any lines in the command file that follow the queue
keyword are processed as if the command file were a shell
script (using your default login shell). For example, a parallel
NAMD job could be run with the following command file:
# @ output = test_namd.$(Cluster).out # @ error = test_namd.$(Cluster).err # @ wall_clock_limit = 1:00:00 # @ class = NORMAL # @ node_usage = not_shared # @ job_type = MPICH # @ node = 4 # @ tasks_per_node = 2 # @ queue mpirun -np $LOADL_TOTAL_TASKS \ -machinefile $LOADL_HOSTFILE \ namd2 apoa1.namd
There are a number of new keywords in the above example. Setting
job_type
to MPICH
causes LoadLeveler to
define the environment variables LOADL_TOTAL_TASKS
and
LOADL_HOSTFILE
. The former is the product of the values
of node
and tasks_per_node
(LOADL_TOTAL_TASKS is 8, in this
example). The latter is a temporary file created by LoadLeveler that
contains the names of the hosts allocated for this job. Additionally,
note the output
and error
keywords in this
example. Including the variable $(Cluster)
causes the
output and error filenames to contain the unique job id assigned to
this job. This prevents overwriting of these files by subsequent jobs
using this same command file. Specifying not_shared
as
the value for the node_usage
keyword will prevent other
jobs from running on the nodes of this job, even if sufficient
resources are available to accommodate them.
Following are a few useful (and in some cases critical) keywords:
notification
: Defines under what conditions email will be sent to the job submitter; possible values arealways
,error
,start
,never
, andcomplete
environment
: Specify in a comma-separated list which environment variables should be available in the job's environment;COPY_ALL
will cause all variables to be available
Queue properties
Each queue on Big Red has properties defined that maximize job throughput and minimize time spent waiting to run. Submit jobs to the queue most appropriate to your job's specific requirements. If you don't specify a queue, you job will go into the LONG queue. Queues overlap to some extent (e.g., the SERIAL queue is made up of nodes that are also members of either the NORMAL or LONG queues).
Note: The total number of idle jobs you may have over all queues is 16. In the DEBUG queue, you may have only one idle job.
- NORMAL: The NORMAL queue is best suited for large
parallel jobs.
Nodes 716 Maximum nodes per job 256 (1,024 cores) Maximum nodes per user 512 Maximum wall time 48 hours Maximum CPU time 48 hours per core per job Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128 - LONG: Use the LONG queue to run jobs requiring
more than 48 hours of wall time.
Nodes 300 Maximum nodes per job 32 (128 cores) Maximum nodes per user 64 Maximum wall time 336 hours (14 days) Maximum CPU time 336 hours (14 days) per core per jobMaximum running jobs in queue Unlimited Maximum idle jobs in queue 256 - SERIAL: Use the SERIAL queue to run jobs
requiring only a single node.
Nodes 1.016 Maximum nodes per job 1 (4 cores) Maximum nodes per user 64 Maximum wall time 168 hours (1 week) Maximum CPU time 192 hours (48 hours x 4 cores)Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128 - DEBUG: The DEBUG queue is a relatively small
queue (4 nodes/16 cores) designed for short-running jobs and intended
for use in diagnosing workflow issues.
Nodes 4 Maximum nodes per job 4 (16 cores) Maximum nodes per user 4 Maximum wall time 15 minutes Maximum CPU time 15 minutes per core per job Maximum running jobs in queue 1 Maximum idle jobs in queue 1 - IEDC: The IEDC queue is dedicated to Big Red's
commercial users.
Nodes 512 Maximum nodes per job 256 (1,024 cores) Maximum nodes per user 256 Maximum wall time 336 hours (14 days) Maximum CPU time 336 hours per core per job Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128
Useful commands
Some common LoadLeveler commands are described below:
llq
: List the contents of the LoadLeveler job queues. When run without arguments, a multi-column listing of jobs will be printed:Id Owner Submitted ST PRI Class Running On --------------- ---------- ----------- -- --- ------------ ----------- s10c2b5.12345.0 johndoe 5/19 13:31 R 50 LONG s14c1b12 s10c2b4.12346.0 janedoe 6/2 06:30 R 50 NORMAL s9c1b9 ... s10c2b5.12468 pvenkman 6/8 15:11 I 50 SERIAL
For the most part, the column headings should be self-explanatory. The ST column indicates job status (
R
= running,I
= idle,C
= complete). For more, see thellq
man
page.llsubmit <command file>
: Submit a job command file to the queues:username@BigRed:~> llsubmit job.cmd llsubmit: Processed command file through Submit Filter: "/home/loadl/scripts/submit_filter.pl". llsubmit: The job "s10c2b5.dim.2343465" has been submitted.
llhold <jobid>
: Place a hold on a job. This will prevent it from being dispatched by the scheduler.llcancel <jobid>
: Cancel a job (idle or running).
Scheduler specifics
The Moab scheduler on Big Red communicates with LoadLeveler, and determines when and where jobs should run on the cluster. Most of Moab's operation is behind the scenes; it calculates dispatch order for queued jobs once every minute, directing LoadLeveler to start jobs when nodes are ready for them, and to stop jobs that have exceeded their wall-clock limit.
Prioritization algorithm
Job dispatch is determined by job priority, which is in turn determined by a customizable algorithm. On Big Red, the factors that most impact job priority are the fairshare usage and the ratio of queue time to wall-clock time (the expansion factor). Fairshare is best described in the Moab documentation:
Fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Moab's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels. You can use both local and global (multi-cluster) fairshare information to make local scheduling decisions.
On Big Red, user fairshare target is 5 percent of the cluster. As a specific user's use of the cluster exceeds this target, priority is decreased. If use is below this target, priority is increased. These adjustments are heavily weighted in the prioritization algorithm so they become the primary determinant of a job's priority.
The expansion factor, while less influential than fairshare, may come into play if a job sits idle for significantly longer than it is expected to run. For example, a job of only an hour's duration begins to rise in priority after sitting idle in the queue for an hour.
In addition to these factors, administrators may adjust baseline priorities to favor certain users in cases of research deadlines or special, high-priority projects. For more information, email High Performance Systems.
When will my job run?
Use the showstart
command to display the scheduler's
best estimate of when your job will be dispatched:
$ showstart -v s10c2b5.2434434.0 job s10c2b5.2434434.0 requires 1 proc for 2:00:00:00 Estimated Rsv based start in 00:24:30 on Mon Jun 8 17:00:25 Estimated Rsv based completion in 2:00:24:30 on Wed Jun 10 17:00:25 Best Partition: base
When will my job really run?
The result of the showstart
command may be inaccurate,
particularly if running jobs (which are preventing your job from
starting) exit prematurely, or if a high-priority job
is submitted before your job's projected start time. The job queue on
Big Red is very dynamic, and all start time estimates should be taken
with a grain of salt.
Useful commands
Some common Moab commands are described below:
showq
: Display the Moab scheduler's job queue. This includes jobs from all of the LoadLeveler classes. Jobs may be in a number of states ("Running" and "Idle" are the most common).showstart <jobid>
: Described above; displays the scheduler's best estimate of when your job will dispatch (i.e., begin running).showbf
: Displays intervals and node counts available for backfill jobs at the time the command is run. For example:$ showbf -c NORMAL Partition Tasks Nodes StartOffset Duration StartDate --------- ----- ----- ------------ ------------ -------------- ALL 408 102 00:00:00 21:18:24 16:41:36_06/07
The above indicates that 102 nodes in the NORMAL queue are available for 21 hours, 18 minutes, and 24 seconds, starting at 4:41pm (the time the command was run). A job that would fit in that geometry should dispatch almost as soon as submitted.
Additional information
The LoadLeveler manual "Using and Administering", particularly the job command file reference (Chapter 14), is an excellent resource for LoadLeveler job scripts.
Adaptive Computing's online Moab documentation, as well as the man pages (available locally), provide useful information on Moab's function and commands.
This is document azvs in the Knowledge Base.
Last modified on 2018-01-18 16:39:52.