Use Slurm to submit and manage jobs on high performance computing systems

On this page:


Overview

The Indiana University research supercomputers use the Slurm Workload Manager to coordinate resource management and job scheduling.

Note:

UITS Research Technologies transitioned the job submission and scheduling environment on Carbonate from TORQUE/Moab to Slurm on April 12, 2021. TORQUE and Moab are no longer available on IU research supercomputers. If you need help converting a job script from TORQUE to Slurm, email the UITS Research Applications and Deep Learning team.

Slurm user commands include numerous options for specifying the resources and other attributes needed to run batch jobs or interactive sessions. Options can be invoked on the command line or with directives contained in a job script.

Common user commands in Slurm include:

Command Description
sbatch Submit a batch script to Slurm. The command exits immediately when the script is transferred to the Slurm controller daemon and assigned a Slurm job ID. For more, see the Batch jobs section below.
srun Run a job on allocated resources. Commonly used in job scripts to launch programs, srun is used also to request resources for interactive jobs.
squeue Monitor job status information. For more, see the Monitor or delete your job section below.
scancel Terminate a queued or running job prior to its completion. For more, see the Monitor or delete your job section below.
sinfo View partition information. For more, see the View partition and node information section below.

Batch jobs

About job scripts

To run a job in batch mode, first prepare a job script with that specifies the application you want to launch and the resources required to run it. Then, use the sbatch command to submit your job script to Slurm. For example, if your script is named my_job.script, you would enter sbatch my_job.script to submit the script to Slurm; if the command runs successfully, it will return a job ID to standard output; for example:

[username@h1 ~]$ sbatch my_job.script
Submitted batch job 9472

Slurm job scripts most commonly have at least one executable line preceded by a list of options that specify the resources and attributes needed to run your job (for example, wall-clock time, the number of nodes and processors, and filenames for job output and errors). When you write a job script, make sure to create it in accordance with the needs of your program. Most importantly, make sure your jobs script will request the proper amount of resources, including memory and time, that are required to run your program.

Serial jobs

A job script for running a serial batch job may look similar to the following:

#!/bin/bash

#SBATCH -J job_name
#SBATCH -p general
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=02:00:00
#SBATCH --mem=16G

#Load any modules that your program needs
module load modulename

#Run your program
srun ./my_program my_program_arguments

In the above example:

  • The first line indicates that the script should be read using the Bash command interpreter.
  • The #SBATCH lines are directives that pass options to the sbatch command:
    • -J job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the system.
    • -p general specifies that the job should run in the general partition.
    • -o filename_%j.txt and -e filename_%j.err instructs Slurm to connect the job's standard output and standard error, respectively, to the file names specified, where %j is automatically replaced by the job ID.
    • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
    • --mail-user=username@iu.edu indicates the email address to which Slurm will send job-related mail.
    • --nodes=1 requests that a minimum of one node be allocated to this job.
    • --ntasks-per-node=1 specifies that one task should be launched per node.
    • --time=02:00:00 requests two hours for the job to run.
    • --mem=16G requests 16 GB of memory.
  • At the bottom are the two executable lines that the job will run. In this case, the module command is used to load a module (modulename), and then srun is used to execute the application with the arguments specified. In your script, replace my_program and my_program_arguments with your program's name and any necessary arguments, respectively.

For information about running GPU-enabled jobs, see About Carbonate's deep learning (DL) and GPU partitions.

OpenMP jobs

If your program can take advantage of multiple processors (for example, if it uses OpenMP), you can add a #SBATCH directive to pass the --cpus-per-task option to sbatch. For example, you could add this line to request that 12 CPUs per task be allocated to your job:

#SBATCH --cpus-per-task=12

If you include this line, make sure it does not request more than the maximum number of CPUs available per node (each system has a different maximum). This type of parallel program can only take advantage of multiple CPUs that are on a single node. Typically, before calling such a program, you should set the OMP_NUM_THREADS environment variable to indicate the number of OpenMP threads that can be used. Unless you want more than one thread running on each CPU, this value is typically equal to the number of CPUs requested. For example:

#Run your program
export OMP_NUM_THREADS=12
srun ./my_program my_program_arguments

MPI jobs

If your program uses MPI (that is, the code is using MPI directives), it can take advantage of multiple processors on more than one node. Request more than one node only if your program is specifically structured to communicate across nodes. MPI programs launch multiple copies of the same program, which then communicate through MPI. One Slurm task is used to run each MPI process. For example, if your MPI program can benefit successfully from 48 processes, and the maximum number of processors available on each node is 24, you could alter the above serial job script example to set --nodes=2 (to request two nodes) and --ntasks-per-node=24 (to request 24 tasks per node):

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24

You also may want to indicate the total number of tasks in your srun command:

srun -n 48 ./my_mpi_program my_program_arguments

The number of processes that can run successfully on one node is limited by the amount of memory available on that node. If each process of a program needs 20 GB of memory, and the node has 240 GB of memory available, you could run a maximum of 12 tasks on each node. In such a case, to run 48 tasks, your script would set --nodes=4 (to request four nodes) and --ntasks-per-node=12 (to request 12 tasks per node). Also, in such a case, your script should also set --mem to request the maximum amount of memory per node, as not all of the processors of the node would be requested. To determine the correct values for your job script, make sure you know the amount of memory available per node and the number of processors available per node for the system you are using.

Hybrid OpenMP-MPI jobs

In a hybrid OpenMP-MPI job, each MPI process uses multiple threads. In addition to #SBATCH directives for MPI, your script should include a #SBATCH directive that requests multiple CPUs per task and an executable line placed before your srun command that sets the OMP_NUM_THREADS environment variable. Typically, one CPU should be allocated to each thread of each process. If each node has 24 processors, and you want to give each process four threads, then a maximum of six tasks can run on each node (if each node has enough memory available to run six copies of the program).

For example, if you want to run 12 processes, each with four threads include the following #SBATCH directives in your script:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=4

Also, include a line that sets OMP_NUM_THREADS=4 before your srun command:

export OMP_NUM_THREADS=4
srun -n 12 ./my_mpi_program my_program_arguments

Other sbatch options

Depending on the resources needed to run your executable lines, you may need to include other sbatch options in your job script. Here a few other useful ones:

Option Action
--begin=YYYY-MM-DDTHH:MM:SS Defer allocation of your job until the specified date and time, after which the job is eligible to execute. For example, to defer allocation of your job until 10:30pm October 31, 2022, use:
--begin=2022-10-31T22:30:00
--no-requeue Specify that the job is not rerunnable. Setting this option prevents the job from being requeued after it has been interrupted, for example, by a scheduled downtime or preemption by a higher priority job.

For complete documentation about the sbatch command and its options, see the sbatch manual page (on the web, see sbatch; on the IU research supercomputers, enter man sbatch).

Interactive jobs

To request resources for an interactive job, use the srun command with the --pty option.

For example:

  • To launch a Bash session that uses one node in the general partition, on the command line, enter:
    srun -p general --pty bash
    
  • To perform debugging, submit an interactive job to the debug or general partition; for example:
    • To request an hour of wall time in the debug partition, on the command line, enter:
      srun -p debug --time=01:00:00 --pty bash
      
    • To request an hour of wall time in the general partition, on the command line, enter:
      srun -p general --time=01:00:00 --pty bash
      
  • To run an interactive job with X11 forwarding enabled, add the --x11 flag; for example:
    srun -p general --x11 --time=01:00:00 --pty bash
    

    When the requested resources are allocated to your job, you will be placed at the command prompt on a compute node. Once you are placed on a compute node, you can launch graphical X applications and your own binaries from the command line. You may need to load the module for a desired X client before launching the application.

When you are finished with your interactive session, on the command line, enter exit to release the allocated resources.

Note:

If you use srun to launch an interactive session as described above, you will not be able to run additional srun commands on the allocated resources. If you need this functionality, you can instead use the salloc command to get a Slurm job allocation, execute a command (such as srun or a shell script containing srun commands), and then, when the command finishes, enter exit to release the allocated resources.

If you do not issue salloc a command, your default shell is executed. From that shell, you can issue any number of commands (including srun commands), and those commands will run on the allocation. When the commands are finished, enter exit to quit the shell and release the allocated resources.

For example:

$ salloc --nodes=1 --ntasks-per-node=24 --time=2:00:00 --mem=128G
salloc: Granted job allocation 109347
salloc: Waiting for resource configuration
salloc: Nodes c18 are ready for job
$ srun -n 24 python my_great_python_mpi_program.py
$ srun <any other commands you want to run>
$ exit
exit
salloc: Relinquishing job allocation 109347

For complete documentation about the srun command, see the srun manual page (on the web, see srun; on the IU research supercomputers, enter man srun).

For complete documentation about the salloc command, see the salloc manual page (on the web, see salloc; on the IU research supercomputers, enter man salloc).

Monitor or delete your job

To monitor the status of jobs in a Slurm partition, use the squeue command. Some useful squeue options include:

Option Description
-a Display information for all jobs.
-j <jobid> Display information for the specified job ID.
-j <jobid> -o %all Display all information fields (with a vertical bar separating each field) for the specified job ID.
-l Display information in long format.
-n <job_name> Display information for the specified job name.
-p <partition_name> Display jobs in the specified partition.
-t <state_list> Display jobs that have the specified state(s). Valid jobs states include PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT, NODE_FAIL, PREEMPTED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, COMPLETING, CONFIGURING, RESIZING, REVOKED, and SPECIAL_EXIT.
-u <username> Display jobs owned by the specified user.

For example:

  • To see all jobs running in the general partition, enter:
    squeue -p general -t RUNNING
    
  • To see pending jobs in the dl partition (on Carbonate) that belong to username, enter:
    squeue -u username -p dl -t PENDING
    

For complete documentation about the squeue command, see the squeue manual page (on the web, see squeue; on the IU research supercomputers, enter man squeue).

To delete your pending or running job, use the scancel command with your job's job ID; for example, to delete your job that has a job ID of 8990, on the command line, enter:

scancel 8990

Alternatively:

  • To cancel a job named my_job, enter:
    scancel -n my_job
    
  • To cancel a job owned by username, enter:
    scancel -u username
    

For complete documentation about the scancel command, see the scancel manual page (on the web, see scancel; on the IU research supercomputers, enter man scancel).

View partition and node information

To view information about the nodes and partitions that Slurm manages, use the sinfo command.

By default, sinfo (without any options) displays:

  • All partition names
  • Availability of each partition
  • Maximum wall time allowed for jobs in each partition
  • Number of nodes in each partition
  • State of the nodes in each partition
  • Names of the nodes in each partition

To display node-specific information, use sinfo -N, which lists:

  • All node names
  • Partition to which each node belongs
  • State of each node

To display additional node-specific information, use sinfo -lN, which adds the following fields to the previous output:

  • Number of cores per node
  • Number of sockets per node, cores per socket, and threads per core
  • Size of memory per node in megabytes

Alternatively, to specify which information fields are displayed and control the formatting of the output, use sinfo with the -o option; for example (replace # with a number to set the display width of the field, and field1 and field2 with the desired field specifications):

sinfo -o "%<#><field1> %<#><field2>"

Available field specifications include:

Specification Field displayed
%<#>P Partition name (set field width to # characters)
%<#>N List of node names (set field width to # characters)
%<#>c Number of cores per node (set field width to # characters)
%<#>m Size of memory per node in megabytes (set field width to # characters)
%<#>l Maximum wall time allowed (set field width to # characters)
%<#>s Maximum number of nodes allowed per job (set field width to # characters)
%<#>G Generic resource associated with a node (set field width to # characters)

For example, on Carbonate, the following sinfo command outputs a node-specific list that includes partition names, node names, the number of cores per node, the amount of memory per node, the maximum wall time allowed per job, and the number and type of generic resources (GPUs) available on each node:

sinfo -No "%10P %8N  %4c  %7m  %12l %10G"

The resulting output looks similar to this:

PARTITION  NODELIST  CPUS  MEMORY   TIMELIMIT    GRES
dl         dl1       24    192888   2-00:00:00   gpu:v100:2
dl         dl2       24    192888   2-00:00:00   gpu:v100:2
dl         dl3       24    192888   2-00:00:00   gpu:p100:2
dl         dl4       24    192888   2-00:00:00   gpu:p100:2
dl         dl5       24    192888   2-00:00:00   gpu:p100:2
dl         dl6       24    192888   2-00:00:00   gpu:p100:2
dl         dl7       24    192888   2-00:00:00   gpu:p100:2
dl         dl8       24    192888   2-00:00:00   gpu:p100:2
dl         dl9       24    192888   2-00:00:00   gpu:p100:2
dl-debug   dl10      24    192888   8:00:00      gpu:p100:2
dl         dl11      24    192888   2-00:00:00   gpu:v100:2
dl         dl12      24    192888   2-00:00:00   gpu:v100:2

For complete documentation about the sinfo command, see the sinfo manual page (on the web, see sinfo; on the IU research supercomputers, enter man sinfo).

Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Get help

SchedMD, the company that distributes and maintains the canonical version of Slurm, provides online user documentation, including a summary of Slurm commands and options, manual pages for all Slurm commands, and a Rosetta Stone of Workload Managers for help determining the Slurm equivalents of commands and options used in other resource management and scheduling systems (for example, TORQUE/PBS).

Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document awrz in the Knowledge Base.
Last modified on 2021-09-28 14:36:01.