ARCHIVED: Batch jobs on Big Red

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

Note: Big Red, originally commissioned in 2006, was retired from service on September 30, 2013. Its replacement, Big Red II, is a hybrid CPU/GPU Cray XE6/XK7 system capable of achieving a theoretical peak performance (Rpeak) of 1 petaFLOPS, making it one of the two fastest university-owned supercomputers in the US. For more, see ARCHIVED: Get started on Big Red II. If you have questions or concerns about the retirement of Big Red, contact the High Performance Systems group.

On this page:

Resource manager and scheduler overview
Resource manager specifics
Scheduler specifics
Additional information

Resource manager and scheduler overview

Big Red includes 1,020 compute nodes intended for the non-interactive processing of batch jobs. Users submit jobs to run on the system through IBM's LoadLeveler resource manager. LoadLeveler, in turn, relies on Adaptive Computing's Moab scheduler to dispatch user jobs to appropriate and available compute nodes. This document provides additional detail regarding the operation of these components.

Resource manager specifics

The LoadLeveler resource manager consists primarily of a number of daemons that run on Big Red compute nodes, and a set of commands for user interaction. The compute node daemons are responsible for starting, monitoring, and stopping jobs on the compute nodes. The commands allow users to submit, modify, and cancel jobs, and monitor job status.

The compute nodes are organized into classes, or queues, with various features and restrictions intended to maximize the utilization of the cluster. For example, a few nodes are allocated to the DEBUG class, which allows only brief, small jobs to run. The purpose of this is to provide a fast-turnaround class for identifying and eliminating command-line errors or typos, rather than testing your job command file in a queue full of long-running large parallel jobs.

Job definition

A LoadLeveler job ("batch job", or simply "job") is defined with a job command file. The command file includes keywords that specify various attributes of the job, such as the number of nodes required, the amount of time necessary for the job to complete, and the commands the job will run on the compute nodes. Once a job command file is prepared, the job may be submitted to LoadLeveler with the llsubmit command.

Keywords in the job command file begin with a hash (or pound sign) followed by any number of spaces, followed by the @ symbol:

  # @ <keyword> = <value>

Comments are lines that start with a hash, followed by anything except a number of spaces and the @ symbol. Thus, you may easily comment out keyword lines:

  ## @ <keyword> = <value> # ignored by LoadLeveler

You may continue keyword lines with the back-slash (\) character:

  # @ <keyword> = <a long long long> \
                  <list of values>

Perhaps the simplest command file for a job on Big Red would be something like:

  # @ executable = hello_world.sh
  # @ queue

The executable keyword specifies the program that should be run (its location is relative to the directory in which the llsubmit command is run, or the directory specified with the initialdir keyword, demonstrated below). The queue keyword marks the end of a job step (jobs may consist of multiple steps; more on that later).

In the previous example, any output the hello_world.sh script prints to STDOUT or STDERR will be lost (directed to /dev/null). Such output can be saved with the output and error keywords, respectively:

  # @ output     = hw.out
  # @ error      = hw.err
  # @ executable = hello_world.sh
  # @ queue

The node class in which the job runs can be specified, as well as the length of time required (class and wall_clock_limit, respectively):

  # @ class            = DEBUG
  # @ wall_clock_limit = 10:00          # 10 minutes
  # @ output           = hw.out
  # @ error            = hw.err
  # @ executable       = hello_world.sh
  # @ queue

The default value for wall_clock_limit is two hours. You should specify accurate wall-clock limit values to prevent your jobs from either terminating before completion, or skewing the scheduler's utilization calculations.

Any lines in the command file that follow the queue keyword are processed as if the command file were a shell script (using your default login shell). For example, a parallel NAMD job could be run with the following command file:

  # @ output = test_namd.$(Cluster).out
  # @ error = test_namd.$(Cluster).err
  # @ wall_clock_limit = 1:00:00
  # @ class = NORMAL
  # @ node_usage = not_shared
  # @ job_type = MPICH
  # @ node = 4
  # @ tasks_per_node = 2
  # @ queue    
  mpirun -np $LOADL_TOTAL_TASKS \
         -machinefile $LOADL_HOSTFILE \
         namd2 apoa1.namd

There are a number of new keywords in the above example. Setting job_type to MPICH causes LoadLeveler to define the environment variables LOADL_TOTAL_TASKS and LOADL_HOSTFILE. The former is the product of the values of node and tasks_per_node (LOADL_TOTAL_TASKS is 8, in this example). The latter is a temporary file created by LoadLeveler that contains the names of the hosts allocated for this job. Additionally, note the output and error keywords in this example. Including the variable $(Cluster) causes the output and error filenames to contain the unique job id assigned to this job. This prevents overwriting of these files by subsequent jobs using this same command file. Specifying not_shared as the value for the node_usage keyword will prevent other jobs from running on the nodes of this job, even if sufficient resources are available to accommodate them.

Following are a few useful (and in some cases critical) keywords:

notification: Defines under what conditions email will be sent to the job submitter; possible values are always, error, start, never, and complete
environment: Specify in a comma-separated list which environment variables should be available in the job's environment; COPY_ALL will cause all variables to be available

Queue properties

Each queue on Big Red has properties defined that maximize job throughput and minimize time spent waiting to run. Submit jobs to the queue most appropriate to your job's specific requirements. If you don't specify a queue, you job will go into the LONG queue. Queues overlap to some extent (e.g., the SERIAL queue is made up of nodes that are also members of either the NORMAL or LONG queues).

Note: The total number of idle jobs you may have over all queues is 16. In the DEBUG queue, you may have only one idle job.

NORMAL: The NORMAL queue is best suited for large parallel jobs.

Nodes	716
Maximum nodes per job	256 (1,024 cores)
Maximum nodes per user	512
Maximum wall time	48 hours
Maximum CPU time	48 hours per core per job
Maximum running jobs in queue	Unlimited
Maximum idle jobs in queue	128

LONG: Use the LONG queue to run jobs requiring more than 48 hours of wall time.

Nodes	300
Maximum nodes per job	32 (128 cores)
Maximum nodes per user	64
Maximum wall time	336 hours (14 days)
Maximum CPU time	336 hours (14 days) per core per job
Maximum running jobs in queue	Unlimited
Maximum idle jobs in queue	256

SERIAL: Use the SERIAL queue to run jobs requiring only a single node.

Nodes	1.016
Maximum nodes per job	1 (4 cores)
Maximum nodes per user	64
Maximum wall time	168 hours (1 week)
Maximum CPU time	192 hours (48 hours x 4 cores)
Maximum running jobs in queue	Unlimited
Maximum idle jobs in queue	128

DEBUG: The DEBUG queue is a relatively small queue (4 nodes/16 cores) designed for short-running jobs and intended for use in diagnosing workflow issues.

Nodes	4
Maximum nodes per job	4 (16 cores)
Maximum nodes per user	4
Maximum wall time	15 minutes
Maximum CPU time	15 minutes per core per job
Maximum running jobs in queue	1
Maximum idle jobs in queue	1

IEDC: The IEDC queue is dedicated to Big Red's commercial users.

Nodes	512
Maximum nodes per job	256 (1,024 cores)
Maximum nodes per user	256
Maximum wall time	336 hours (14 days)
Maximum CPU time	336 hours per core per job
Maximum running jobs in queue	Unlimited
Maximum idle jobs in queue	128

Useful commands

Some common LoadLeveler commands are described below:

llq: List the contents of the LoadLeveler job queues. When run without arguments, a multi-column listing of jobs will be printed:

  Id              Owner      Submitted   ST PRI Class        Running On 
  --------------- ---------- ----------- -- --- ------------ -----------
  s10c2b5.12345.0 johndoe    5/19 13:31  R  50  LONG         s14c1b12
  s10c2b4.12346.0 janedoe    6/2  06:30  R  50  NORMAL       s9c1b9
  ...
  s10c2b5.12468   pvenkman   6/8  15:11  I  50  SERIAL

For the most part, the column headings should be self-explanatory. The ST column indicates job status (R = running, I = idle, C = complete). For more, see the llq man page.

llsubmit <command file>: Submit a job command file to the queues:

  username@BigRed:~> llsubmit job.cmd
  llsubmit: Processed command file through Submit Filter:
  "/home/loadl/scripts/submit_filter.pl".
  llsubmit: The job "s10c2b5.dim.2343465" has been submitted.

llhold <jobid>: Place a hold on a job. This will prevent it from being dispatched by the scheduler.
llcancel <jobid>: Cancel a job (idle or running).

Scheduler specifics

The Moab scheduler on Big Red communicates with LoadLeveler, and determines when and where jobs should run on the cluster. Most of Moab's operation is behind the scenes; it calculates dispatch order for queued jobs once every minute, directing LoadLeveler to start jobs when nodes are ready for them, and to stop jobs that have exceeded their wall-clock limit.

Prioritization algorithm

Job dispatch is determined by job priority, which is in turn determined by a customizable algorithm. On Big Red, the factors that most impact job priority are the fairshare usage and the ratio of queue time to wall-clock time (the expansion factor). Fairshare is best described in the Moab documentation:

Fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. Moab's fairshare implementation allows organizations to set system utilization targets for users, groups, accounts, classes, and QoS levels. You can use both local and global (multi-cluster) fairshare information to make local scheduling decisions.

On Big Red, user fairshare target is 5 percent of the cluster. As a specific user's use of the cluster exceeds this target, priority is decreased. If use is below this target, priority is increased. These adjustments are heavily weighted in the prioritization algorithm so they become the primary determinant of a job's priority.

The expansion factor, while less influential than fairshare, may come into play if a job sits idle for significantly longer than it is expected to run. For example, a job of only an hour's duration begins to rise in priority after sitting idle in the queue for an hour.

In addition to these factors, administrators may adjust baseline priorities to favor certain users in cases of research deadlines or special, high-priority projects. For more information, email High Performance Systems.

When will my job run?

Use the showstart command to display the scheduler's best estimate of when your job will be dispatched:

  $ showstart -v s10c2b5.2434434.0
  job s10c2b5.2434434.0 requires 1 proc for 2:00:00:00
  Estimated Rsv based start in        00:24:30 on Mon Jun  8 17:00:25
  Estimated Rsv based completion in 2:00:24:30 on Wed Jun 10 17:00:25
  Best Partition: base

When will my job really run?

The result of the showstart command may be inaccurate, particularly if running jobs (which are preventing your job from starting) exit prematurely, or if a high-priority job is submitted before your job's projected start time. The job queue on Big Red is very dynamic, and all start time estimates should be taken with a grain of salt.

Useful commands

Some common Moab commands are described below:

showq: Display the Moab scheduler's job queue. This includes jobs from all of the LoadLeveler classes. Jobs may be in a number of states ("Running" and "Idle" are the most common).
showstart <jobid>: Described above; displays the scheduler's best estimate of when your job will dispatch (i.e., begin running).
showbf: Displays intervals and node counts available for backfill jobs at the time the command is run. For example:
```
     $ showbf -c NORMAL
     Partition  Tasks  Nodes   StartOffset      Duration       StartDate
     ---------  -----  -----  ------------  ------------  --------------
     ALL          408    102      00:00:00      21:18:24  16:41:36_06/07
```
The above indicates that 102 nodes in the NORMAL queue are available for 21 hours, 18 minutes, and 24 seconds, starting at 4:41pm (the time the command was run). A job that would fit in that geometry should dispatch almost as soon as submitted.

Additional information

The LoadLeveler manual "Using and Administering", particularly the job command file reference (Chapter 14), is an excellent resource for LoadLeveler job scripts.

Adaptive Computing's online Moab documentation, as well as the man pages (available locally), provide useful information on Moab's function and commands.