About Carbonate's deep learning (DL) nodes

On this page:


System overview

To facilitate the support of deep learning applications and research, Indiana University's Carbonate cluster has been expanded to include 12 GPU-accelerated Lenovo ThinkSystem SD530 compute nodes. Each of these deep learning (DL) nodes is equipped with two Intel Xeon Gold 6126 12-core CPUs, two NVIDIA GPU accelerators (eight with Tesla P100s; four with Tesla V100s), four 1.92 TB solid-state drives, and 192 GB of RAM. All DL nodes are housed in the IU Bloomington Data Center, run Red Hat Enterprise 7.x, and are connected to the IU Science DMZ via 10-gigabit Ethernet.

Carbonate's DL nodes use the Slurm Workload Manager to coordinate resource management and job scheduling. The Data Capacitor II (DC2), DC-WAN2, Slate, and Slate-Project high-performance file systems are mounted for temporary storage of research data.

System access

The Carbonate DL nodes are intended for use by users with deep learning workloads. IU students, faculty, and staff with deep learning workloads can request accounts on the Carbonate DL nodes by following these steps:

  1. If you don't already have an account on Carbonate, request one using the instructions in Get additional IU computing accounts.
  2. Fill out and submit the Request Access to Specialized HPC Resources form. Include a description of your deep learning application(s), and your hardware and software requirements.
Notes:
  • For enhanced security, SSH connections that have been idle for 60 minutes will be disconnected. To protect your data from misuse, remember to log off or lock your computer whenever you leave it.
  • The scheduled monthly maintenance window for IU's high-performance computing systems is the second Sunday of each month, 7am-7pm.

Deep learning software

For a list of packages available on Carbonate, see HPC Applications.

Carbonate has two deep learning modules available:

  • deeplearning/1.13.1 (based on Python 3)
  • deeplearning/python2.7/1.13.1 (based on Python 2)

Both modules contain frequently used deep learning tools including scikit-learn, NumPy, SciPy, NLTK, Torch, Caffe2, and MXNet. The deeplearning/1.13.1 module also includes TensorFlow and Keras.

Note:
Carbonate users are free to install software in their home directories and may request the installation of software for use by all users on the system. Only faculty or staff can request software. If students require software packages on Carbonate, their advisors must request them. For details, see IU policies relative to installing software on Carbonate. To request software, use the HPC Software Request form.

Run jobs on Carbonate's DL nodes

Carbonate's DL nodes use the Slurm workload manager to manage and schedule interactive sessions and batch jobs. To use the Carbonate DL nodes, you must:

  • Specify the dl or dl-debug partition by including the -p flag either as an SBATCH directive in your batch job script or as an option in your srun command.
  • Indicate the type of GPU (p100 or v100) and the number of GPUs (1 or 2) that should be allocated to your job by including the --gres flag either as an SBATCH directive in your batch job script or as an option in your srun command.

    If you omit the --gres flag when requesting resources in the dl or dl-debug partition, srun and sbatch will return an error:

    Must specify a gpu resource:
    Resubmit with --gres=gpu:type:count, where type is p100 or v100.
    Batch job submission failed: Unspecified error
    
Note:
User processes on the login nodes are limited to 20 minutes of CPU time. Processes on the login nodes that run longer than 20 minutes are terminated automatically (without warning).

Submit an interactive job

To request resources for an interactive job, use the srun command with the --pty option.

For example, to launch a Bash session that uses one V100 GPU on a node in the dl partition, on the command line, enter:

srun -p dl --gres=gpu:v100:1 --pty bash

To perform GPU debugging, submit an interactive job to the dl-debug or dl partition:

  • To request 12 cores and one P100 GPU for four hours of wall time in the dl-debug partition, on the command line, enter:
    srun -p dl-debug -N 1 --ntasks-per-node=12 --gres=gpu:p100:1 --time=4:00:00 --pty bash
    
  • To request 12 cores and one P100 GPU for 12 hours of wall time in the dl partition, on the command line, enter:
    srun -p dl -N 1 --ntasks-per-node=12 --gres=gpu:p100:1 --time=12:00:00 --pty bash
    

When the requested resources are allocated to your job, you will be placed at the command prompt on one of Carbonate's DL nodes. When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page (on the web, see srun; on Carbonate, enter man srun).

Submit a batch job

To run a batch job on the DL nodes, first prepare a Slurm job script (for example, job.sh) that includes the #SBATCH -p dl directive (for routing your job to the dl partition), and then use the sbatch command to submit it. If the command exits successfully, it will return a job ID; for example:

[sgerrera@h1]$ sbatch job.sh
sbatch: Submitted batch job 99999999
[sgerrera@h1]$

If your job has resource requirements that are different from the defaults (but not exceeding the maximums allowed), specify them with SBATCH directives in your job script. Also, if you need help determining how much memory your job is using, add the following SBATCH directives to your job script (replace username@iu.edu with your IU email address):

#SBATCH --mail-user=username@iu.edu
#SBATCH --mail-type=ALL

When the job completes, Slurm will email the specified address with a summary of the job's resource utilization.

For example, a job script for running a batch job on the Carbonate DL nodes could look similar to the following:

#!/bin/bash

#SBATCH -J job_name
#SBATCH -p dl
#SBATCH -o filename_%j.txt
#SBATCH -e filename_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@iu.edu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:p100:1
#SBATCH --time=02:00:00

module load modulename
./a.out

In the example script above:

  • The first line indicates that the script should be read using the Bash command interpreter.
  • The next lines are #SBATCH directives used to pass options to the sbatch command:
    • -J job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the systems.
    • -p dl specifies that the job should run in Carbonate's dl partition.
    • -o filename_%j.txt and -e filename_%j.err instructs Slurm to redirect the job's standard output and standard error, respectively, to the file names specified (Slurm automatically replaces %j with the job ID).
    • --mail-user=username@iu.edu indicates the email address to which Slurm will send job-related mail.
    • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include all, begin, end, and fail.
    • --gres=gpu:p100:1 requests that one P100 GPU be allocated to this job.
    • --time=02:00:00 requests that the job run for a minimum of two hours.
  • The last two lines are the two executable lines that the job will run. In this case, the module command is used to load the modulename module before the a.out binary is executed.

For more, see Use Slurm to submit and manage jobs on high-performance computing systems.

Partition (queue) information

To view current information about Carbonate's DL nodes, use the sinfo command. You should see output similar to the following:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dl           up 2-00:00:00     11   idle dl[1-9,11-12]
dl-debug     up    8:00:00      1   idle dl10

In the above sample output:

  • The PARTITION column shows the names of the available partitions.
  • The AVAIL column shows the status of each partition.
  • The TIMELIMIT column shows the maximum wall time that users can request.
  • The NODES column shows the number of nodes in each partition.
  • The STATE column shows the current status of each partition.
  • The NODELIST column shows the actual nodes that are part of each partition.

To submit a batch job to the dl partition, add the #SBATCH -p=dl directive to your job script. To submit a batch job to the dl-debug partition, add the #SBATCH -p=dl-debug directive to your job script.

Note:
To best meet the needs of all research projects affiliated with Indiana University, UITS Research Technologies administers the batch job queues on IU's research supercomputers using resource management and job scheduling policies that optimize the overall efficiency and performance of workloads on those systems. If the structure or configuration of the batch queues on any of IU's research supercomputers does not meet the needs of your research project, contact UITS Research Technologies.

Get help

For an overview of Carbonate documentation, see Get started on Carbonate.

Support for IU research supercomputers, software, and services is provided by various teams within the Research Technologies division of UITS.

For general questions about research computing at IU, contact UITS Research Technologies.

For more options, see Research computing support at IU.

This is document avjk in the Knowledge Base.
Last modified on 2019-12-26 13:02:36.

Contact us

For help or to comment, email the UITS Support Center.