Run GPU-accelerated jobs on Quartz or Big Red 200 at IU

On this page:


Partitions with GPU-accelerated nodes

To facilitate the support of deep learning and GPU-accelerated applications, Quartz and Big Red 200 provide partitions for running jobs on GPU-accelerated nodes.

  • Quartz GPU partition: Quartz's GPU partition consists of 24 GPU-accelerated Apollo 6500 nodes. Each node is equipped with two Intel 6248 2.5 GHz 20-core CPUs, four NVIDIA Tesla V100 PCIe 32 GB GPUs, one 1.92 TB solid-state drive, and 768 GB of RAM.

    All nodes in the GPU partition are housed in the IU Bloomington Data Center and run Red Hat Enterprise 8.x. The GPU nodes are connected to the IU Science DMZ via 40-gigabit Ethernet.

  • Big Red 200: Big Red 200's GPU partition consists of 64 GPU-accelerated nodes, each with 256 GB of memory, a single 64-core, 2.0 GHz, 225-watt AMD EPYC 7713 processor, and four NVIDIA Tesla A100 GPUs.

    Big Red 200 is managed with HPE's Performance Cluster Manager (HPCM) and runs SUSE Enterprise Linux Server (SLES) version 15 on the compute, GPU, and login nodes.

Quartz GPU partition Big Red 200 GPU partition
Architecture Apollo 6500 Cray HPE EX
Nodes 24 64
GPUs/node 4 V100 GPUs 4 A100 GPUs
Cores/node 40 64
Memory/node 768 GB 256 GB
Memory/GPU 32 GB 40 GB

The Indiana University research supercomputers use the Slurm workload manager for resource management and job scheduling; see Use Slurm to submit and manage jobs on IU's research computing systems.

The Slate, Slate-Project, and Slate-Scratch high performance file systems are mounted for persistent storage of research data.

System access

To set up access to run GPU jobs on Quartz or Big Red 200, IU faculty, staff, and graduate students can use RT Projects to create projects, request allocations, and add users (research collaborators, lab members, and/or students) who should be permitted to use their allocations.

For more about RT Projects, see Use RT Projects to request and manage access to specialized Research Technologies resources.

Notes:
  • For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
  • User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).
  • The scheduled monthly maintenance window for IU's research supercomputers is the second Sunday of each month, 7am-7pm.

Deep learning tools and GPU-capable applications

The IU research supercomputers use module-based environment management systems that provide a convenient method for dynamically customizing your software environment.

Quartz and Big Red 200 use the Lmod module management system.

For more, see Use modules to manage your software environment on IU research supercomputers.

Popular deep learning tools (including TensorFlow, scikit-learn, NumPy, SciPy, NLTK, Torch, and MXNet) are bundled together in python/gpu modules available on Quartz and Big Red 200.

  • To list available versions, on the command line, enter:
    module avail python/gpu
  • To see which packages are included in a version of the python/gpu module, on the command line, enter (replace x.x.x with the module's version number):
    module show python/gpu/x.x.x
    pip freeze
  • To add a python/gpu module to your user environment, on the command line, enter (replace x.x.x with the module's version number):
    module load python/gpu/x.x.x

Individual modules for GPU-capable applications also are available on Quartz and Big Red 200. Modules for GPU-capable applications will include gpu in the module name.

To see which applications are available on a particular system, on the command line, enter module avail.

Note:

Quartz and Big Red 200 users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require special software packages, their advisors must request them. For details, see Software requests in Policies regarding UITS research systems.

Run GPU-accelerated applications

To run a GPU-accelerated application on the GPU partition on Quartz or Big Red 200:

  • Specify the GPU partition by including the -p gpu flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Specify how many GPUs per node (up to 4) should be allocated to your job by including the --gpus-per-node flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Specify your allocation's Slurm Account Name by including the -A (or --account) flag either as an SBATCH directive in your batch job script or as an option in your srun command.

    Users belonging to projects approved through RT Projects can find their allocation's Slurm Account Name on the "Home" page in RT Projects; look under "Submitting Slurm Jobs with your Project's Account"; alternatively, on the "Home" page, under "Allocations", select an allocation and look in the table under "Allocation Attributes".

Submit an interactive job

To request resources for an interactive job, use the srun command with the --pty option. For example:

  • To launch a Bash session that uses one V100 GPU on a node in Quartz's GPU partition, on the command line, enter (replace slurm-account-name with your allocation's Slurm Account Name):
    srun -p gpu -A slurm-account-name --gpus-per-node v100:1 --pty bash
  • To launch a Bash session that uses one A100 GPU on a node in Big Red 200's GPU partition, on the command line, enter:
    srun -p gpu -A slurm-account-name --gpus-per-node 1 --pty bash

When the requested resources are allocated to your job, you will be placed at the command prompt on one of nodes in the partition you specified. When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page (on the web, see srun; on Quartz or Big Red 200, enter man srun).

Submit a batch job

To run a batch job, prepare a Slurm job script (for example, job.sh) that includes SBATCH directives for specifying the required resources, and then use the sbatch command to submit it. If the command exits successfully, it will return a job ID; for example:

sgerrera@login2:~> sbatch job.sh
sbatch: Submitted batch job 99999999
sgerrera@login2:~>

If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu with your IU email address):

#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL

When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.

For example, a job script for running a batch job on Big Red 200's GPU partition may look similar to the following:

#!/bin/bash

#SBATCH -J job_name
#SBATCH -p gpu
#SBATCH -A slurm-account-name
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
#SBATCH --time=02:00:00

#Load any modules that your program needs
module load modulename

#Run your program
srun ./my_program my_program_arguments

In the example script above:

  • The first line indicates that the script should be read using the Bash command interpreter.
  • The next lines are #SBATCH directives used to pass options to the sbatch command:
    • -J job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.
    • -p gpu specifies that the job should run in the GPU partition.
    • -A slurm-account-name indicates the Slurm Account Name to which resources used by this job should be charged.
    • -o filename_%j.txt and -e filename_%j.err instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces %j with the job ID).
    • --mail-user=username@iu.edu indicates the email address to which Slurm will send job-related mail.
    • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
    • --gpus-per-node=1 requests that one GPU be allocated to this job.
    • --time=02:00:00 requests that the job run for a minimum of two hours.
  • At the bottom are the two executable lines that the job will run. In this case, the module command is used to load a module (modulename), and then srun is used to execute the application with the arguments specified. In your script, replace my_program and my_program_arguments with your program's name and any necessary arguments, respectively.

For more, see Use Slurm to submit and manage jobs on IU's research computing systems.

Partition (queue) information

To view current information about a partition, use the sinfo command. For example, on Big Red 200, to view information about the GPU partition, on the command line, enter:

sinfo -p gpu

You should see output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up 2-00:00:00      2 drain* nid[0672,0696]
gpu          up 2-00:00:00     12  down* nid[0641,0653,0658,0679,0688-0689,0692-0693,0697-0699,0702]
gpu          up 2-00:00:00      8    mix nid[0642,0651,0673-0674,0676-0677,0700-0701]
gpu          up 2-00:00:00     40  alloc nid[0643-0650,0652,0654-0657,0659-0671,0675,0678,0680-0687,0690-0691,0694-0695]

In the above sample output:

  • The PARTITION column shows the partition name.
  • The AVAIL column shows the status of the partition.
  • The TIMELIMIT column shows the maximum wall time that users can request.
  • The NODES column shows the number of nodes in each partition.
  • The STATE column shows the current status of each partition.
  • The NODELIST column shows the actual nodes that are part of each partition.
Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Get help

Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document avjk in the Knowledge Base.
Last modified on 2024-01-12 17:05:24.