About Carbonate's deep learning (DL) and GPU partitions
On this page:
- System overview
- System access
- Deep learning and GPU software
- Run jobs on Carbonate's DL and GPU partitions
- Partition (queue) information
- Get help
System overview
To facilitate the support of deep learning (DL) and GPU applications and research, Indiana University's Carbonate cluster includes two separate partitions composed of GPUs.
The DL partition consists of 12 GPU-accelerated Lenovo ThinkSystem SD530 compute nodes. Each of these deep learning nodes is equipped with two Intel Xeon Gold 6126 12-core CPUs, two NVIDIA GPU accelerators (eight with Tesla P100s; four with Tesla V100s), one 1.92 TB solid-state drive, and 192 GB of RAM.
The GPU partition consists of an additional 24 GPU-accelerated Apollo 6500 nodes. Each node is equipped with two Intel 6248 2.5GHz 20-core CPUs, 768 GB of RAM, 4 NVIDIA V100-PCIE-32GB GPUs, and one 1.92 TB solid-state drive.
Carbonate's DL and GPU partitions use the Slurm Workload Manager to coordinate resource management and job scheduling. The Slate, Slate-Project, and Slate-Scratch high performance file systems are mounted for persistent storage of research data; Data Capacitor Wide Area Network 2 (DC-WAN2) provides temporary storage for projects that require remote mounts.
All DL and GPU nodes are housed in the IU Bloomington Data Center and run Red Hat Enterprise 7.x. The DL nodes are connected to the IU Science DMZ via 10-gigabit Ethernet, and the GPU nodes are connected via 40-gigabit Ethernet.
Below is an overview of the DL and GPU partitions:
DL partition | GPU partition | |
---|---|---|
Architecture | Lenovo ThinkSystem SD530 | Apollo 6500 |
Nodes | 12 (8 P100, 4 V100) | 24 |
GPUs/node | 2 | 4 |
Cores/node | 24 | 40 |
Memory/node | 192 GB |
768 GB |
Scheduler | Slurm | Slurm |
System access
The Carbonate DL partition is intended specifically for use by users with deep learning workloads, while the GPU partition is intended for use by users with any workloads that can benefit from GPUs. IU students, faculty, and staff can request accounts on the Carbonate DL and GPU partitions by following these steps:
- If you don't already have an account on Carbonate, request one using the instructions in Get additional IU computing accounts.
- Use IU HPC Projects to request access to either or ; for instructions, see Request and manage access to specialized high performance computing (HPC) resources.
- For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
- The scheduled monthly maintenance window for IU's high performance computing systems is the second Sunday of each month, 7am-7pm.
Deep learning and GPU software
The IU research supercomputers use module-based environment management systems that provide a convenient method for dynamically customizing your software environment. Carbonate uses the Modules module management system. For more, see Use modules to manage your software environment on IU research supercomputers.
Popular deep learning tools (including TensorFlow, scikit-learn, NumPy, SciPy, NLTK, Torch, and MXNet) are bundled together in deeplearning
modules. Several versions of the deeplearning
module are available on Carbonate:
- To list available versions, on the command line, enter:
module avail deeplearning
- To see which packages are included in a version of the
deeplearning
module, on the command line, enter (replacex.x.x
with the module's version number):module show deeplearning/x.x.x
- To add a
deeplearning
module to your user environment, on the command line, enter (replacex.x.x
with the module's version number):module load deeplearning/x.x.x
Individual modules for GPU-capable applications also are available on Carbonate. Modules for GPU-capable applications will include gpu
in the module name.
For a list of all packages available on Carbonate, see HPC Applications.
Run jobs on Carbonate's DL and GPU partitions
Carbonate's DL and GPU partitions use the Slurm workload manager to manage and schedule interactive sessions and batch jobs.
Carbonate DL partition
To use the Carbonate DL partition, you must:
- Specify the dl or dl-debug partition by including the
-p
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command. - Indicate the type of GPU (
p100
orv100
) and the number of GPUs per node (1
or2
) that should be allocated to your job by including the--gpus-per-node
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command.
Carbonate GPU partition
To use the Carbonate GPU partition, you must:
- Specify the gpu or gpu-debug partition by including the
-p
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command. - Indicate the number of GPUs per node (up to
4
) that should be allocated to your job by including the--gpus-per-node
flag either as an SBATCH directive in your batch job script or as an option in yoursrun
command.
Submit an interactive job
To request resources for an interactive job, use the srun
command with the --pty
option.
For example, to launch a Bash session that uses one V100 GPU on a node in the gpu partition, on the command line, enter:
srun -p gpu --gpus-per-node v100:1 --pty bash
To launch a Bash session that uses one V100 GPU on a node in the dl partition, on the command line, enter:
srun -p dl --gpus-per-node v100:1 --pty bash
To perform GPU debugging, submit an interactive job to the dl-debug or dl partition:
- To request 12 cores and one P100 GPU for four hours of wall time in the dl-debug partition, on the command line, enter:
srun -p dl-debug -N 1 --ntasks-per-node=12 --gpus-per-node p100:1 --time=1:00:00 --pty bash
- To request 12 cores and one P100 GPU for 12 hours of wall time in the dl partition, on the command line, enter:
srun -p dl -N 1 --ntasks-per-node=12 --gpus-per-node p100:1 --time=12:00:00 --pty bash
When the requested resources are allocated to your job, you will be placed at the command prompt on one of Carbonate's DL nodes. When you are finished with your interactive session, on the command line, enter exit
to free the allocated resources.
For complete documentation about the srun
command, see the srun
manual page (on the web, see srun; on Carbonate, enter man srun
).
Submit a batch job
To run a batch job on the DL nodes, first prepare a Slurm job script (for example, job.sh
) that includes the #SBATCH -p dl
directive (for routing your job to the dl partition), and then use the sbatch
command to submit it. If the command exits successfully, it will return a job ID; for example:
[sgerrera@h1]$ sbatch job.sh sbatch: Submitted batch job 99999999 [sgerrera@h1]$
If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu
with your IU email address):
#SBATCH --mail-user=username@iu.edu #SBATCH --mail-type=ALL
When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.
For example, a job script for running a batch job on Carbonate's DL partition may look similar to the following:
#!/bin/bash #SBATCH -J job_name #SBATCH -p dl #SBATCH -o filename_%j.txt #SBATCH -e filename_%j.err #SBATCH --mail-type=ALL #SBATCH --mail-user=username@iu.edu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node p100:1 #SBATCH --time=02:00:00 #Load any modules that your program needs module load modulename #Run your program srun ./my_program my_program_arguments
In the example script above:
- The first line indicates that the script should be read using the Bash command interpreter.
- The next lines are
#SBATCH
directives used to pass options to thesbatch
command:-J job_name
specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.-p dl
specifies that the job should run in Carbonate's dl partition.-o filename_%j.txt
and-e filename_%j.err
instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces%j
with the job ID).--mail-user=username@iu.edu
indicates the email address to which Slurm will send job-related mail.--mail-type=<type>
directs Slurm to send job-related email when an event of the specified type(s) occurs; validtype
values includeall
,begin
,end
, andfail
.--gpus-per-node p100:1
requests that one P100 GPU be allocated to this job.--time=02:00:00
requests that the job run for a minimum of two hours.
- At the bottom are the two executable lines that the job will run. In this case, the
module
command is used to load a module (modulename
), and thensrun
is used to execute the application with the arguments specified. In your script, replacemy_program
andmy_program_arguments
with your program's name and any necessary arguments, respectively.
For more, see Use Slurm to submit and manage jobs on high performance computing systems.
Partition (queue) information
To view current information about Carbonate's DL and GPU partitions, use the sinfo
command. You should see output similar to the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST dl up 2-00:00:00 1 drng dl3 dl up 2-00:00:00 1 comp dl2 dl up 2-00:00:00 8 mix dl[1,4-6,8-9,11-12] dl up 2-00:00:00 1 alloc dl7 gpu up 2-00:00:00 2 resv g[15,18] gpu up 2-00:00:00 18 mix g[1-6,9-14,16-17,19-22] gpu up 2-00:00:00 2 alloc g[7-8] dl-debug up 1:00:00 1 alloc dl10 gpu-debug up 1:00:00 2 idle g[23-24]
In the above sample output:
- The
PARTITION
column shows the names of the available partitions. - The
AVAIL
column shows the status of each partition. - The
TIMELIMIT
column shows the maximum wall time that users can request. - The
NODES
column shows the number of nodes in each partition. - The
STATE
column shows the current status of each partition. - The
NODELIST
column shows the actual nodes that are part of each partition.
To submit a batch job to the dl partition, add the #SBATCH -p dl
directive to your job script. To submit a batch job to the dl-debug partition, add the #SBATCH -p dl-debug
directive to your job script.
Get help
For an overview of Carbonate documentation, see Get started on Carbonate.
Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.
- If you have a system-specific question, contact the High Performance Systems (HPS) team.
- If you have a programming question about compilers, scientific/numerical libraries, or debuggers, contact the UITS Research Applications and Deep Learning team.
For general questions about research computing at IU, contact UITS Research Technologies.
For more options, see Research computing support at IU.
This is document avjk in the Knowledge Base.
Last modified on 2022-04-08 16:23:08.