ARCHIVED: Run batch jobs on Big Red II

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

On this page:

Note:

Big Red II was retired from service on December 15, 2019; for more, see ARCHIVED: About Big Red II at Indiana University (Retired).


Overview

To run a batch job on Big Red II at Indiana University, first prepare a TORQUE script that specifies the application you want to run, the execution environment in which it should run, and the resources your job will require, and then submit your job script from the command line using the TORQUE qsub command. You can monitor your job's progress using the TORQUE qstat command, the Moab showq command, or via the HPC everywhere beta.

Because Big Red II runs the Cray Linux Environment (CLE), which comprises two distinct execution environments for running batch jobs (Extreme Scalability Mode and Cluster Compatibility Mode), TORQUE scripts used to run jobs on typical high-performance computing clusters (for example, Karst) will not work on Big Red II without modifications. For your application to run properly in either of Big Red II's execution environments, your TORQUE script must invoke one of two proprietary application launch commands:

Execution environment Application launch command
Extreme Scalability Mode (ESM)

aprun

The aprun command is part of Cray's Application Level Placement Scheduler (ALPS), the interface for compute node job placement and execution.

Use aprun to specify the required parameters for launching your application in the native (ESM) execution environment. Include the -n option, at least, to specify the number of processing elements required to run your job.

Cluster Compatibility Mode (CCM)

ccmrun

The ccmrun command targets your application for launch on a compute node provisioned for the CCM execution environment.

To run a CCM job, you first must load the ccm module. For the job to launch properly, you also must add the following TORQUE directive to your batch job script or qsub command:

-l gres=ccm

Applications launched without the aprun or ccmrun command will not be placed on Big Red II's computation nodes. Instead, they will execute on the aprun service nodes, which are intended only for passing job requests. Because the aprun nodes are shared by all currently running jobs, any memory- or computationally-intensive job launched there will be terminated to avoid disrupting service for every user on the system.

Make sure to run your applications on Big Red II's compute nodes, not on the service (login or aprun) nodes.

For more CLE execution environments in CLE, see ARCHIVED: Execution environments on Big Red II at IU: Extreme Scalability Mode (ESM) and Cluster Compatibility Mode (CCM).

Prepare a TORQUE script for a Big Red II batch job

A TORQUE job script is a text file you prepare that specifies the application to run and the resources required to run it.

"Shebang" line

Your TORQUE job script should begin with a "shebang" (#!) line that provides the path to the command interpreter the operating system should use to execute your script.

A basic job script for Big Red II could contain only the "shebang" line followed by an executable line; for example:

#!/bin/bash
aprun -n 1 date

The above example script would run one instance of the date application on a compute node in the Extreme Scalability Mode (ESM) execution environment.

TORQUE directives

Your job script also may include TORQUE directives that specify required resources, such as the number of nodes and processors needed to run your job, and the wall-clock time needed to complete your job. TORQUE directives are indicated in your script by lines that begin with #PBS; for example:

#PBS -l nodes=2:ppn=32:dc2
#PBS -l walltime=24:00:00

The above example TORQUE directives indicate a job requires two nodes, 32 processors per node, the Data Capacitor II parallel file system (/N/dc2), and 24 hours of wall-clock time.

Scripts for CCM jobs must include the following TORQUE directive:

#PBS -l gres=ccm
Note:
The TORQUE directives in your script must precede your executable lines; if directives occur on subsequent lines, they will be ignored.

Executable lines

Executable lines are used to invoke basic commands and launch applications; for example:

module load ccm
module load sas

cd /N/u/davader/BigRed2/sas_files
ccmrun sas SAS_input.sas

The above example executable lines load the ccm and sas modules, change the working directory to the location of the SAS_input.sas file, and launch the SAS application on a compute node in the CCM execution environment.

Note:
Your application's executable line must begin with one of the application launch commands (aprun for ESM jobs; ccmrun for CCM jobs), or else your application will launch on an aprun service node instead of a compute node. Launching an application on an aprun node can cause a service disruption for all users on the system. Consequently, any job running on an aprun node will be terminated.

Example serial job scripts

Run an ESM serial job

To run a compiled program on one or more compute nodes in the Extreme Scalability Mode (ESM) environment, the application execution line in your batch script must begin with the aprun application launching command.

The following example batch script will run a serial job that executes my_binary on all 32 cores of one compute node in the ESM environment:

#!/bin/bash
#PBS -l nodes=1:ppn=32
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q cpu
#PBS -V

aprun -n 32 my_binary

Run a CCM serial job

To run a compiled program on one or more compute nodes in the CCM environment:

  • You must add the ccm module to your user environment with this module load command:
    module load ccm
    

    You can add this line to your TORQUE batch job script (after your TORQUE directives and before your application execution line). Alternatively, to permanently add the ccm module to your user environment, add the line to your~/.modules file; see ARCHIVED: Use a .modules file in your home directory to save your user environment on an IU research supercomputer.

  • Your script must include a TORQUE directive that invokes the -l gres=ccm flag. Alternatively, when you submit your job, you can add the -l gres=ccm flag as a command-line option to qsub.
  • The application execution line in your batch script must begin with the ccmrun application launching command.

The following example batch script will run a serial job that executes my_binary on all 32 cores of one compute node in the CCM environment:

#!/bin/bash
#PBS -l nodes=1:ppn=32
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q cpu
#PBS -l gres=ccm
#PBS -V

ccmrun my_binary
Note:
Big Red II has 32 cores per CPU node and 16 cores per GPU node. UITS recommends setting ppn=32 or ppn=16 to ensure full access to all the cores on a node. Single-processor applications will not use more than one core. To pack multiple single-processor jobs onto a single node using PCP, see Use PCP to bundle multiple serial jobs to run in parallel on IU research supercomputers

Example MPI job scripts

Run an ESM MPI job

You can use the aprun command to run MPI jobs in the ESM environment. The aprun command functions similarly to the mpirun and mpiexec commands commonly used on high-performance computing clusters, such as IU's Karst system.

The following example batch script will run a job that executes my_binary on two nodes and 64 cores in the ESM environment on Big Red II:

#!/bin/bash
#PBS -l nodes=2:ppn=32
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q cpu
#PBS -V

aprun -n 64 my_binary

Run a CCM MPI job

To run MPI jobs in the CCM environment:

  • Your code must be compiled with the CCM Open MPI library; to add the library to your environment, load the openmpi module.

    To load the openmpi module, you must have the GNU programming environment (PrgEnv-gnu) module loaded. To verify that the PrgEnv-gnu module is loaded, on the command line, run module list, and then review the list of currently loaded modules. If another programming environment module (for example, PrgEnv-cray) is loaded, use the module swap command to replace it with the PrgEnv-gnu module; for example, on the command line, enter:

    module swap PrgEnv-cray PrgEnv-gnu
    
  • You must add the ccm module to your user environment. To permanently add the ccm module to your user environment, add the module load ccm line to your ~/.modules file; see ARCHIVED: Use a .modules file in your home directory to save your user environment on an IU research supercomputer. Alternatively, you can add the module load ccm command as a line in your TORQUE batch job script (after your TORQUE directives and before your application execution line).
  • Your job script must include a TORQUE directive that invokes the -l gres=ccm flag. Alternatively, when you submit your job, you can add the -l gres=ccm flag as a command-line option to qsub.
  • The application execution line in your batch script must begin with the ccmrun application launch command.

Assuming the ccm module is already loaded, the following example batch script will run a job that loads the openmpi/ccm/gnu/1.7.2 module and executes my_binary on two nodes and 64 cores in the CCM environment on Big Red II:

#!/bin/bash
#PBS -l nodes=2:ppn=32
#PBS -l walltime=00:10:00
#PBS -l gres=ccm
#PBS -N my_job
#PBS -q cpu
#PBS -V 
  
module load openmpi/ccm/gnu/1.7.2 
  
ccmrun mpirun -np 64 my_binary

Submit, monitor, and delete jobs

Submit jobs

To submit a job script (for example, my_job_script.pbs), use the TORQUE qsub command:

qsub [options] my_job_script.pbs

For a full description of the qsub command and available options, see its manual page.

Monitor jobs

To monitor the status of a queued or running job, you can use the TORQUE qstat command. Useful options include:

Option Function
-u user_list Displays jobs for users listed in user_list
-a Displays all jobs
-r Displays running jobs
-f Displays full listing of jobs (returns excessive detail)
-n Displays nodes allocated to jobs

For example, to see all the jobs running in the long queue, use:

qstat -r long | less

The Moab job scheduler also provides several useful commands for monitoring jobs and batch system information:

Moab command Function
showq Display the jobs in the Moab job queue.
(Jobs may be in a number of states; "running" and "idle" are the most common.)
checkjob jobid Check the status of a job (jobid). For verbose mode, add -v (for example, checkjob -v jobid).
showstart jobid Show an estimate of when your job (jobid) might start.
mdiag -f Show fairshare information.
checknode node_name Check the status of a node (node_name).
showres Show current reservations.
showbf Show intervals and node counts presently available for backfill jobs.

For example, to list queued jobs in the order they were dispatched, on the command line, enter:

showq -i | less

For a full description of the showq command and available options, see its manual page.

Alternatively, you can monitor your job via the HPC everywhere beta.

Directly monitor processes on CCM nodes

If you have a job running in the CCM execution environment, you can SSH directly to the compute node(s) on which it is running, and from there use the ps command and/or top program to monitor the number of processes on the node, the status of each process, and the percentage of memory and CPU usage per process. If necessary, you can use the kill command to kill or suspend processes.

Note:
You can do this only if you're the owner of the job. Also, you can do this only on CCM nodes, not on nodes provisioned to the ESM execution environment. Additionally, do not launch any computations directly from the CCM compute nodes.

To SSH directly to a compute node running your CCM job:

  1. Determine which node is running your CCM job:
    1. On the command line, enter (replacing jobid with your job ID):
      qstat -f jobid
      

      If you don't remember your job ID, look it up with the qstat command; on the command line, enter (replace username with your IU username):

      qstat -u username
      
    2. Derive the compute node's name from the exec_host value listed in the output. Node names on Big Red II have five numerical characters, so prepend zeroes (as needed) to any values with fewer than five characters; for example:
      • If your job is running on node nid00786, you will see:
        exec_host = 786/0
        
      • If your job is running on node nid00023, you will see:
        exec_host = 23
        

      If your job is running across multiple hosts, you'll see multiple values for exec_host. As long as you are the owner of the CCM job, you'll be able to access any of the nodes on which it is running.

  2. SSH to one of the aprun service nodes; on the command line, enter:
    ssh aprun1
    

    When prompted for a password, enter your IU passphrase.

  3. From the aprun node, connect via SSH to port 203 on the desired compute node; for example, to connect to node nid00786, on the command line, enter:
    ssh -p 203 nid00786
    

For information about using ps, top, and kill to monitor and manage processes, see their respective manual pages.

Delete jobs

To delete queued or running jobs, use the TORQUE qdel command:

Command Function
qdel jobid Delete a specific job (jobid).
qdel all Delete all jobs.

Occasionally, a node becomes unresponsive and won't respond to the TORQUE server's requests to delete a job. If that occurs, add the -W (uppercase W) option:

qdel -W jobid

If that doesn't work, email the High Performance Systems group for help.

Get help

Although UITS Research Technologies cannot provide dedicated access to an entire research supercomputer during the course of normal operations, "single user time" is made available by request one day a month during each system's regularly scheduled maintenance window to accommodate IU researchers with tasks requiring dedicated access to an entire compute system. To request "single user time" or ask for more information, contact UITS Research Technologies.

Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Research computing support at IU is provided by the Research Technologies division of UITS. To ask a question or get help regarding Research Technologies services, including IU's research supercomputers and research storage systems, and the scientific, statistical, and mathematical applications available on those systems, contact UITS Research Technologies. For service-specific support contact information, see Research computing support at IU.

This is document bdkt in the Knowledge Base.
Last modified on 2023-04-21 16:57:19.