Batch jobs on Big Red
Note: Big Red is scheduled to be retired from service in June 2013. Indiana University is replacing it with Big Red II, the fastest university-owned supercomputer in the nation, capable of performing one quadrillion floating-point operations per second (1 petaflop). Based on Cray XE/XK technology, Big Red II has 676 XK nodes (each containing one AMD "Interlagos" processor and one NVIDIA "Kepler" GPU) and 344 XE nodes (each containing two AMD "Abu Dhabi" processors). For more, see Big Red II at Indiana University.
On this page:
- Resource manager and scheduler overview
- Resource manager specifics
- Scheduler specifics
- Additional information
Resource manager and scheduler overview
Big Red includes 1,020 compute nodes intended for the non-interactive processing of batch jobs. Users submit jobs to run on the system through IBM's LoadLeveler resource manager. LoadLeveler, in turn, relies on Adaptive Computing's Moab scheduler to dispatch user jobs to appropriate and available compute nodes. This document provides additional detail regarding the operation of these components.
Resource manager specifics
The LoadLeveler resource manager consists primarily of a number of daemons that run on Big Red compute nodes, and a set of commands for user interaction. The compute node daemons are responsible for starting, monitoring, and stopping jobs on the compute nodes. The commands allow users to submit, modify, and cancel jobs, and monitor job status.
The compute nodes are organized into classes, or queues, with various features and restrictions intended to maximize the utilization of the cluster. For example, a few nodes are allocated to the DEBUG class, which allows only brief, small jobs to run. The purpose of this is to provide a fast-turnaround class for identifying and eliminating command-line errors or typos, rather than testing your job command file in a queue full of long-running large parallel jobs.
A LoadLeveler job ("batch job", or simply "job") is
defined with a job command file. The command file includes keywords
that specify various attributes of the job, such as the number of
nodes required, the amount of time necessary for the job to complete,
and the commands the job will run on the compute nodes. Once a job
command file is prepared, the job may be submitted to LoadLeveler with
Keywords in the job command file begin with a hash (or pound sign)
followed by any number of spaces, followed by the
Comments are lines that start with a hash, followed by anything
except a number of spaces and the
@ symbol. Thus, you
may easily comment out keyword lines:
You may continue keyword lines with the back-slash (
Perhaps the simplest command file for a job on Big Red would be something like:# @ executable = hello_world.sh # @ queue
executable keyword specifies the program that
should be run (its location is relative to the directory in which the
llsubmit command is run, or the directory specified with
initialdir keyword, demonstrated below). The
queue keyword marks the end of a job step (jobs may
consist of multiple steps; more on that later).
In the previous example, any output the
script prints to
STDERR will be
lost (directed to
/dev/null). Such output can be saved
The node class in which the job runs can be specified, as well as
the length of time required (
The default value for
wall_clock_limit is two hours.
You should specify accurate wall-clock limit values to prevent your
jobs from either terminating before completion, or skewing the
scheduler's utilization calculations.
Any lines in the command file that follow the
keyword are processed as if the command file were a shell
script (using your default login shell). For example, a parallel
NAMD job could be run with the following command file:
There are a number of new keywords in the above example. Setting
MPICH causes LoadLeveler to
define the environment variables
LOADL_HOSTFILE. The former is the product of the values
tasks_per_node (LOADL_TOTAL_TASKS is 8, in this
example). The latter is a temporary file created by LoadLeveler that
contains the names of the hosts allocated for this job. Additionally,
error keywords in this
example. Including the variable
$(Cluster) causes the
output and error filenames to contain the unique job id assigned to
this job. This prevents overwriting of these files by subsequent jobs
using this same command file. Specifying
the value for the
node_usage keyword will prevent other
jobs from running on the nodes of this job, even if sufficient
resources are available to accommodate them.
Following are a few useful (and in some cases critical) keywords:
notification: Defines under what conditions email will be sent to the job submitter; possible values are
environment: Specify in a comma-separated list which environment variables should be available in the job's environment;
COPY_ALLwill cause all variables to be available
Each node class has properties defined that maximize job throughput and minimize time spent waiting in queue. Submit jobs to the queue most appropriate to the specific job requirements. If no queue is specified, jobs will go into the LONG queue. Node classes overlap to some extent (e.g., the SERIAL class is made up of nodes that are also members of either the NORMAL or LONG classes). Brief descriptions follow:
Note: The total number of idle jobs you may have over all queues is 16. In the DEBUG queue, you may have only one idle job.
NORMAL: This class is best suited for large
Nodes 716 Maximum nodes per job 256 (1,024 cores) Maximum nodes per user 512 Maximum wall time 48 hours Maximum CPU time 48 hours per core per job Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128
LONG: Jobs requiring more than 48 hours of wall
time should be run in the LONG class.
Nodes 300 Maximum nodes per job 32 (128 cores) Maximum nodes per user 64 Maximum wall time 336 hours (14 days) Maximum CPU time 336 hours (14 days) per core per job Maximum running jobs in queue Unlimited Maximum idle jobs in queue 256
SERIAL: Jobs that require only a single node may
be submitted to the SERIAL class.
Nodes 1.016 Maximum nodes per job 1 (4 cores) Maximum nodes per user 64 Maximum wall time 168 hours (1 week) Maximum CPU time 192 hours (48 hours x 4 cores) Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128
DEBUG: This is a small class (4 nodes/16 cores)
for short-running jobs, and is intended for use in diagnosing workflow
Nodes 4 Maximum nodes per job 4 (16 cores) Maximum nodes per user 4 Maximum wall time 15 minutes Maximum CPU time 15 minutes per core per job Maximum running jobs in queue 1 Maximum idle jobs in queue 1
IEDC: This class is dedicated to Big Red's
Nodes 512 Maximum nodes per job 256 (1,024 cores) Maximum nodes per user 256 Maximum wall time 336 hours (14 days) Maximum CPU time 336 hours per core per job Maximum running jobs in queue Unlimited Maximum idle jobs in queue 128
Some common LoadLeveler commands are described below:
llq: List the contents of the LoadLeveler job queues. When run without arguments, a multi-column listing of jobs will be printed: Id Owner Submitted ST PRI Class Running On --------------- ---------- ----------- -- --- ------------ ----------- s10c2b5.12345.0 johndoe 5/19 13:31 R 50 LONG s14c1b12 s10c2b4.12346.0 janedoe 6/2 06:30 R 50 NORMAL s9c1b9 ... s10c2b5.12468 pvenkman 6/8 15:11 I 50 SERIAL
For the most part, the column headings should be self-explanatory. The ST column indicates job status (
C= complete). For more, see the
llsubmit <command file>: Submit a job command file to the queues: username@BigRed:~> llsubmit job.cmd llsubmit: Processed command file through Submit Filter: "/home/loadl/scripts/submit_filter.pl". llsubmit: The job "s10c2b5.dim.2343465" has been submitted.
llhold <jobid>: Place a hold on a job. This will prevent it from being dispatched by the scheduler.
llcancel <jobid>: Cancel a job (idle or running).
The Moab scheduler on Big Red communicates with LoadLeveler, and determines when and where jobs should run on the cluster. Most of Moab's operation is behind the scenes; it calculates dispatch order for queued jobs once every minute, directing LoadLeveler to start jobs when nodes are ready for them, and to stop jobs that have exceeded their wall-clock limit.
Job dispatch is determined by job priority, which is in turn determined by a customizable algorithm. On Big Red, the factors that most impact job priority are the fairshare usage and the ratio of queue time to wall-clock time (the expansion factor). Fairshare is best described in the Moab documentation:
Fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Moab's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels. You can use both local and global (multi-cluster) fairshare information to make local scheduling decisions.
On Big Red, user fairshare target is 5 percent of the cluster. As a specific user's use of the cluster exceeds this target, priority is decreased. If use is below this target, priority is increased. These adjustments are heavily weighted in the prioritization algorithm so they become the primary determinant of a job's priority.
The expansion factor, while less influential than fairshare, may come into play if a job sits idle for significantly longer than it is expected to run. For example, a job of only an hour's duration begins to rise in priority after sitting idle in the queue for an hour.
In addition to these factors, administrators may adjust baseline priorities to favor certain users in cases of research deadlines or special, high-priority projects. For more information, email High Performance Systems.
When will my job run?
showstart command to display the scheduler's
best estimate of when your job will be dispatched:
When will my job really run?
The result of the
showstart command may be inaccurate,
particularly if running jobs (which are preventing your job from
starting) exit prematurely, or if a high-priority job
is submitted before your job's projected start time. The job queue on
Big Red is very dynamic, and all start time estimates should be taken
with a grain of salt.
Some common Moab commands are described below:
showq: Display the Moab scheduler's job queue. This includes jobs from all of the LoadLeveler classes. Jobs may be in a number of states ("Running" and "Idle" are the most common).
showstart <jobid>: Described above; displays the scheduler's best estimate of when your job will dispatch (i.e., begin running).
showbf: Displays intervals and node counts available for backfill jobs at the time the command is run. For example: $ showbf -c NORMAL Partition Tasks Nodes StartOffset Duration StartDate --------- ----- ----- ------------ ------------ -------------- ALL 408 102 00:00:00 21:18:24 16:41:36_06/07
The above indicates that 102 nodes in the NORMAL queue are available for 21 hours, 18 minutes, and 24 seconds, starting at 4:41pm (the time the command was run). A job that would fit in that geometry should dispatch almost as soon as submitted.
The LoadLeveler manual "Using and Administering", particularly the job command file reference (Chapter 14), is an excellent resource for LoadLeveler job scripts.
Adaptive Computing's online Moab documentation, as well as the man pages (available locally), provide useful information on Moab's function and commands.
Last modified on February 19, 2013.