About Carbonate's deep learning (DL) and GPU partitions

On this page:


System overview

To facilitate the support of deep learning (DL) and GPU applications and research, Indiana University's Carbonate cluster includes two separate partitions composed of GPUs.

The DL partition consists of 12 GPU-accelerated Lenovo ThinkSystem SD530 compute nodes. Each of these deep learning nodes is equipped with two Intel Xeon Gold 6126 12-core CPUs, two NVIDIA GPU accelerators (eight with Tesla P100s; four with Tesla V100s), four 1.92 TB solid-state drives, and 192 GB of RAM.

The GPU partition consists of an additional 24 GPU-accelerated Apollo 6500 nodes. Each node is equipped with two Intel 6248 2.5GHz 20-core CPUs, 768 GB of RAM, 4 NVIDIA V100-PCIE-32GB GPUs, and one 1.92 TB solid-state drive.

Carbonate's DL and GPU partitions use the Slurm Workload Manager to coordinate resource management and job scheduling. The Slate and Slate-Project high performance file systems are mounted for persistent storage of research data; DC-WAN2 provides temporary storage for projects that require remote mounts.

All DL and GPU nodes are housed in the IU Bloomington Data Center and run Red Hat Enterprise 7.x. The DL nodes are connected to the IU Science DMZ via 10-gigabit Ethernet, and the GPU nodes are connected via 40-gigabit Ethernet.

Below is an overview of the DL and GPU partitions:

  DL partition GPU partition
Architecture Lenovo ThinkSystem SD530 Apollo 6500
Nodes 12 (8 P100, 4 V100) 24
GPUs/node 2 4
Cores/node 24 40
Memory/node 192GB

768GB

Scheduler Slurm Slurm

System access

The Carbonate DL partition is intended specifically for use by users with deep learning workloads, while the GPU partition is intended for use by users with any workloads that can benefit from GPUs. IU students, faculty, and staff can request accounts on the Carbonate DL and GPU partitions by following these steps:

  1. If you don't already have an account on Carbonate, request one using the instructions in Get additional IU computing accounts.
  2. Use IU HPC Projects to request access to either Carbonate Deep Learning Partition or Carbonate GPU Partition; for instructions, see Request and manage access to specialized high performance computing (HPC) resources.
Notes:
  • For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
  • The scheduled monthly maintenance window for IU's high performance computing systems is the second Sunday of each month, 7am-7pm.

Deep learning and GPU software

For a list of packages available on Carbonate, see HPC Applications.

Popular DL software has been bundled together on Carbonate via the "deep-learning" modules. Significant versions include:

  • deeplearning/1.13.1 (based on Python 3 and TensorFlow 1.13.1)
  • deeplearning/2.3.0 (based on Python 3 and TensorFlow 2.3.0)

Both modules contain frequently used deep learning tools including scikit-learn, NumPy, SciPy, NLTK, Torch, Caffe2, and MXNet. The deeplearning/1.13.1 and deeplearning/2.3.0 modules also include Keras.

Otherwise, individual modules are available for other GPU applications. The modules for applications that can take advantage of GPUs will be in the format xyz/gpu/version (for example, lammps/gpu/gnu/3Mar20).

Note:
Carbonate users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require software packages on Carbonate, their advisors must request them. For details, see IU policies relative to installing software on Carbonate. To request software, use the HPC Software Request form.

Run jobs on Carbonate's DL and GPU partitions

Carbonate's DL and GPU partitions use the Slurm workload manager to manage and schedule interactive sessions and batch jobs.

Carbonate DL partition

To use the Carbonate DL partition, you must:

  • Specify the dl or dl-debug partition by including the -p flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Indicate the type of GPU (p100 or v100) and the number of GPUs per node (1 or 2) that should be allocated to your job by including the --gpus flag either as an SBATCH directive in your batch job script or as an option in your srun command.

Carbonate GPU partition

To use the Carbonate GPU partition, you must:

  • Specify the gpu or gpu-debug partition by including the -p flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Indicate the number of GPUs per node (up to 4) that should be allocated to your job by including the --gpus flag either as an SBATCH directive in your batch job script or as an option in your srun command.
Note:
User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).

Submit an interactive job

To request resources for an interactive job, use the srun command with the --pty option.

For example, to launch a Bash session that uses one V100 GPU on a node in the gpu partition, on the command line, enter:

srun -p gpu --gpus v100:1 --pty bash

To launch a Bash session that uses one V100 GPU on a node in the dl partition, on the command line, enter:

srun -p dl --gpus v100:1 --pty bash

To perform GPU debugging, submit an interactive job to the dl-debug or dl partition:

  • To request 12 cores and one P100 GPU for four hours of wall time in the dl-debug partition, on the command line, enter:
    srun -p dl-debug -N 1 --ntasks-per-node=12 --gpus p100:1 --time=4:00:00 --pty bash
    
  • To request 12 cores and one P100 GPU for 12 hours of wall time in the dl partition, on the command line, enter:
    srun -p dl -N 1 --ntasks-per-node=12 --gpus p100:1 --time=12:00:00 --pty bash
    

When the requested resources are allocated to your job, you will be placed at the command prompt on one of Carbonate's DL nodes. When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page (on the web, see srun; on Carbonate, enter man srun).

Submit a batch job

To run a batch job on the DL nodes, first prepare a Slurm job script (for example, job.sh) that includes the #SBATCH -p dl directive (for routing your job to the dl partition), and then use the sbatch command to submit it. If the command exits successfully, it will return a job ID; for example:

[sgerrera@h1]$ sbatch job.sh
sbatch: Submitted batch job 99999999
[sgerrera@h1]$

If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu with your IU email address):

#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL

When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.

For example, a job script for running a batch job on the Carbonate DL nodes could look similar to the following:

#!/bin/bash

#SBATCH -J job_name
#SBATCH -p dl
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus p100:1
#SBATCH --time=02:00:00

module load modulename
./a.out

In the example script above:

  • The first line indicates that the script should be read using the Bash command interpreter.
  • The next lines are #SBATCH directives used to pass options to the sbatch command:
    • -J job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.
    • -p dl specifies that the job should run in Carbonate's dl partition.
    • -o filename_%j.txt and -e filename_%j.err instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces %j with the job ID).
    • --mail-user=username@iu.edu indicates the email address to which Slurm will send job-related mail.
    • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
    • --gpus p100:1 requests that one P100 GPU be allocated to this job.
    • --time=02:00:00 requests that the job run for a minimum of two hours.
  • The last two lines are the two executable lines that the job will run. In this case, the module command is used to load the modulename module before the a.out binary is executed.

For more, see Use Slurm to submit and manage jobs on high performance computing systems.

Partition (queue) information

To view current information about Carbonate's DL and GPU partitions, use the sinfo command. You should see output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dl           up 2-00:00:00     11   idle dl[1-9,11-12]
dl-debug     up    8:00:00      1   idle dl10
gpu		up	2-00:00:00  22 idle g[1-22]
gpu-debug	up	1:00:00  	2  idle	g[23-24]

In the above sample output:

  • The PARTITION column shows the names of the available partitions.
  • The AVAIL column shows the status of each partition.
  • The TIMELIMIT column shows the maximum wall time that users can request.
  • The NODES column shows the number of nodes in each partition.
  • The STATE column shows the current status of each partition.
  • The NODELIST column shows the actual nodes that are part of each partition.

To submit a batch job to the dl partition, add the #SBATCH -p=dl directive to your job script. To submit a batch job to the dl-debug partition, add the #SBATCH -p=dl-debug directive to your job script.

Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Get help

For an overview of Carbonate documentation, see Get started on Carbonate.

Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document avjk in the Knowledge Base.
Last modified on 2020-11-17 13:18:35.

Contact us

For help or to comment, email the UITS Support Center.