Run GPU-accelerated jobs on Carbonate or Big Red 200 at IU

On this page:


Partitions with GPU-accelerated nodes

To facilitate the support of deep learning and GPU-accelerated applications, Carbonate and Big Red 200 provide partitions for running jobs on GPU-accelerated nodes.

  • Carbonate GPU partition: Carbonate's GPU partition consists of 24 GPU-accelerated Apollo 6500 nodes. Each node is equipped with two Intel 6248 2.5 GHz 20-core CPUs, four NVIDIA Tesla V100 PCIe 32 GB GPUs, one 1.92 TB solid-state drive, and 768 GB of RAM.

    All nodes in the GPU partition are housed in the IU Bloomington Data Center and run Red Hat Enterprise 7.x. The GPU nodes are connected to the IU Science DMZ via 40-gigabit Ethernet.

  • Big Red 200: Big Red 200's GPU partition consists of 64 GPU-accelerated nodes, each with 256 GB of memory, a single 64-core, 2.0 GHz, 225-watt AMD EPYC 7713 processor, and four NVIDIA Tesla A100 GPUs.

    Big Red 200 is managed with HPE's Performance Cluster Manager (HPCM) and runs SUSE Enterprise Linux Server (SLES) version 15 on the compute, GPU, and login nodes.

Carbonate GPU partition Big Red 200 GPU partition
Architecture Apollo 6500 Cray HPE EX
Nodes 24 64
GPUs/node 4 V100 GPUs 4 A100 GPUs
Cores/node 40 64
Memory/node 768 GB 256 GB

IU research supercomputers use the Slurm Workload Manager to coordinate resource management and job scheduling. The Slate, Slate-Project, and Slate-Scratch high performance file systems are mounted for persistent storage of research data; Data Capacitor Wide Area Network 2 (DC-WAN2) provides temporary storage for projects that require remote mounts.

Notes:
  • For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
  • User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).
  • The scheduled monthly maintenance window for IU's high performance computing systems is the second Sunday of each month, 7am-7pm.

System access

  • Carbonate: To set up access to Carbonate's GPU partition, IU faculty, staff, and graduate students can use RT Projects to create projects, request allocations, and add users (research collaborators, lab members, and/or students) who should be permitted to use their allocations. To use the GPU partition, project members will need their own accounts on Carbonate, which they can request using the instructions in Get additional IU computing accounts. For instructions on setting up a project that provides access to Carbonate's GPU partition, see Use RT Projects to request and manage access to specialized Research Technologies resources.
    Note:

    As of September 11, 2022, requests for access to the Carbonate GPU partition must be submitted via RT Projects and no longer via the IU HPC Projects platform.

    Users belonging to current projects approved through IU HPC Projects can retain their access to Carbonate's GPU partition through November 14, 2022, by adding the Slurm Account Name "legacy-projects" to their job submissions (for example, by including #SBATCH -A legacy-projects in a Slurm batch job script). To maintain access beyond that date, the project's submitter or indicated Principal Investigator must create a new project via RT Projects. Project descriptions from IU HPC Projects may be reused or updated; the My Projects tab at IU HPC Projects will remain available for referencing previous submissions.

    If you have questions about requesting or updating projects through RT Projects, email the RT Projects development team.

  • Big Red 200: The Big Red 200 GPU partition is accessible to any user with an account on Big Red 200. If you don't already have an account on Big Red 200, request one using the instructions in Get additional IU computing accounts.

Deep learning tools and GPU-capable applications

The IU research supercomputers use module-based environment management systems that provide a convenient method for dynamically customizing your software environment. Carbonate uses the Modules module management system; Big Red 200 uses Lmod. For more, see Use modules to manage your software environment on IU research supercomputers.

Popular deep learning tools (including TensorFlow, scikit-learn, NumPy, SciPy, NLTK, Torch, and MXNet) are bundled together in deeplearning modules available on Carbonate and Big Red 200.

  • To list available versions, on the command line, enter:
    module avail deeplearning
    
  • To see which packages are included in a version of the deeplearning module, on the command line, enter (replace x.x.x with the module's version number):
    module show deeplearning/x.x.x
    
  • To add a deeplearning module to your user environment, on the command line, enter (replace x.x.x with the module's version number):
    module load deeplearning/x.x.x

Individual modules for GPU-capable applications also are available on Carbonate and Big Red 200. Modules for GPU-capable applications will include gpu in the module name.

For a list of all packages available on Carbonate or Big Red 200, see HPC Applications.

Note:
Carbonate and Big Red 200 users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require special software packages, their advisors must request them. For details, see Software requests in Policies regarding UITS research systems.

Run GPU-accelerated applications

To run a GPU-accelerated application on the GPU partition on Carbonate or Big Red 200:

  • Specify the GPU partition by including the -p gpu flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Specify how many GPUs per node (up to 4) should be allocated to your job by including the --gpus-per-node flag either as an SBATCH directive in your batch job script or as an option in your srun command.

Additionally, to use the Carbonate GPU partition, you must specify the allocation's Slurm Account Name by including the -A (or --account) flag either as an SBATCH directive in your batch job script or as an option in your srun command.

Note:
  • Users belonging to projects approved through IU HPC Projects can retain their access to Carbonate's GPU partition through November 14, 2022, by using the Slurm Account Name "legacy-projects" in their job submissions (for example, by including the -A legacy-projects flag either as an SBATCH directive in a batch job script or as an option in an srun command).
  • Users belonging to projects approved through RT Projects can find their allocation's Slurm Account Name on the "Home" page in RT Projects; look under "Submitting Slurm Jobs with your Project's Account"; alternatively, on the "Home" page, under "Allocations", select an allocation and look in the table under "Allocation Attributes".

Submit an interactive job

To request resources for an interactive job, use the srun command with the --pty option. For example:

  • To launch a Bash session that uses one V100 GPU on a node in Carbonate's GPU partition, on the command line, enter (replace slurm-account-name with your allocation's Slurm Account Name):
    srun -p gpu -A slurm-account-name --gpus-per-node v100:1 --pty bash
    
    Note:
    Include the -A flag only when using the Carbonate GPU partition. Do not include it for jobs on the Big Red 200 GPU partition.
  • To launch a Bash session that uses one A100 GPU on a node in Big Red 200's GPU partition, on the command line, enter:
    srun -p gpu --gpus-per-node 1 --pty bash
    

When the requested resources are allocated to your job, you will be placed at the command prompt on one of nodes in the partition you specified. When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page (on the web, see srun; on Carbonate or Big Red 200, enter man srun).

Submit a batch job

To run a batch job, prepare a Slurm job script (for example, job.sh) that includes SBATCH directives for specifying the required resources, and then use the sbatch command to submit it. If the command exits successfully, it will return a job ID; for example:

[sgerrera@h1]$ sbatch job.sh
sbatch: Submitted batch job 99999999
[sgerrera@h1]$

If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu with your IU email address):

#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL

When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.

For example, a job script for running a batch job on Carbonate's GPU partition may look similar to the following:

#!/bin/bash

#SBATCH -J job_name
#SBATCH -p gpu
#SBATCH -A slurm-account-name
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node v100:1
#SBATCH --time=02:00:00

#Load any modules that your program needs
module load modulename

#Run your program
srun ./my_program my_program_arguments

In the example script above:

  • The first line indicates that the script should be read using the Bash command interpreter.
  • The next lines are #SBATCH directives used to pass options to the sbatch command:
    • -J job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.
    • -p gpu specifies that the job should run in Carbonate's GPU partition.
    • -A slurm-account-name indicates the Slurm Account Name to which resources used by this job should be charged.
      Note:
      Include the -A flag only when using the Carbonate GPU partition. Do not include it for jobs on the Big Red 200 GPU partition.
    • -o filename_%j.txt and -e filename_%j.err instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces %j with the job ID).
    • --mail-user=username@iu.edu indicates the email address to which Slurm will send job-related mail.
    • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
    • --gpus-per-node v100:1 requests that one V100 GPU be allocated to this job.
    • --time=02:00:00 requests that the job run for a minimum of two hours.
  • At the bottom are the two executable lines that the job will run. In this case, the module command is used to load a module (modulename), and then srun is used to execute the application with the arguments specified. In your script, replace my_program and my_program_arguments with your program's name and any necessary arguments, respectively.

For more, see Use Slurm to submit and manage jobs on high performance computing systems.

Partition (queue) information

To view current information about a partition, use the sinfo command. For example, on Big Red 200, to view information about the GPU partition, on the command line, enter:

sinfo -p gpu

You should see output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up 2-00:00:00      2 drain* nid[0672,0696]
gpu          up 2-00:00:00     12  down* nid[0641,0653,0658,0679,0688-0689,0692-0693,0697-0699,0702]
gpu          up 2-00:00:00      8    mix nid[0642,0651,0673-0674,0676-0677,0700-0701]
gpu          up 2-00:00:00     40  alloc nid[0643-0650,0652,0654-0657,0659-0671,0675,0678,0680-0687,0690-0691,0694-0695]

In the above sample output:

  • The PARTITION column shows the partition name.
  • The AVAIL column shows the status of the partition.
  • The TIMELIMIT column shows the maximum wall time that users can request.
  • The NODES column shows the number of nodes in each partition.
  • The STATE column shows the current status of each partition.
  • The NODELIST column shows the actual nodes that are part of each partition.
Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Get help

Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document avjk in the Knowledge Base.
Last modified on 2022-09-12 15:01:31.